Extract HTML rows with headers specified

kalyanrajsista has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I'm trying to extract HTML table rows with the following code

use strict;
use warnings;
use HTML::TableExtract;
use Data::Dumper;

my $html = qq{
<HTML>
  <BODY>
    <table border="1">
      <tr><td align="center" nowrap><font size="2"><u>Activity #</u><t
+d align="center"><font size="2">Some&nbsp;ID<br>/Debit&nbsp;ID</font>
+</td></tr>
      <tr><td align="right"><font size="2">588476377</font></td><td><f
+ont size="2"><a href="/cgi-bin/page?id=1275591">1275591</a></font></t
+d></tr>
      <tr><td align="right"><font size="2">588484813</font></td><td><f
+ont size="2"><a href="/cgi-bin/page?id=1210540">1210540</a></font></t
+d></tr>
    </table>
  </BODY>
</HTML>
};

my $te = HTML::TableExtract->new( headers => ['Some ID'] );
$te->parse($html);

eval {
    $te->rows;
};

if ( $@ ) {
    print "No rows found\n";
}

print Dumper($te->rows);
[download]

When trying to extract with table headers like 'Invoice ID' which doesn't display '\ ' in the webpage, code is displaying as 'No rows found'. How can I handle to extract the data even when there are any spaces, '/' or any other characters inside the headers.

Comment on Extract HTML rows with headers specified Download Code

Replies are listed 'Best First'.
Re: Extract HTML rows with headers specified by wfsp (Abbot) on Jan 29, 2010 at 08:49 UTC
Changing `['Some ID']` [download] to `['Some ID']` [download] and it works ok here. Update: No it doesn't :-( But it's because `Some ID` isn't the same as `Some ID` (although it may look the same in the browser). Update2: `my $header = q{Some} . chr(0x0A0) . q{ID}; my $te = HTML::TableExtract->new( headers => [$header] );` [download]	[reply] [d/l] [select]
Re: Extract HTML rows with headers specified by steve (Deacon) on Jan 29, 2010 at 19:27 UTC
HTML::TableExtract indicates that there is a "decode" constructor attribute that is described as follows: Automatically decode retrieved text with HTML::Entities::decode_entities(). Enabled by default. Has no effect if keep_html was specified or if extracting into an element tree structure. The following works for me: my $html = qq{ <HTML> <BODY> <table border="1"> <tr><td align="center" nowrap><font size="2"><u>Activity #</u><t +d align="center"><font size="2">Some ID<br>/Debit ID</font> +</td></tr> <tr><td align="right"><font size="2">588476377</font></td><td><f +ont size="2"><a href="/cgi-bin/page?id=1275591">1275591</a></font></t +d></tr> <tr><td align="right"><font size="2">588484813</font></td><td><f +ont size="2"><a href="/cgi-bin/page?id=1210540">1210540</a></font></t +d></tr> </table> </BODY> </HTML> }; my $te = HTML::TableExtract->new( headers => ['Some ID'] , decode + => 0); $te->parse($html); eval { $te->rows; }; if ( $@ ) { print "No rows found\n"; } print Dumper($te->rows); [download]	[reply] [d/l]
Re: Extract HTML rows with headers specified by Anonymous Monk on Jan 29, 2010 at 08:58 UTC
Try turn on debugging	[reply]


Problems? Is your data what you think it is?
	PerlMonks