Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Extract HTML rows with headers specified

by kalyanrajsista (Scribe)
on Jan 29, 2010 at 05:10 UTC ( [id://820306]=perlquestion: print w/replies, xml ) Need Help??

kalyanrajsista has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I'm trying to extract HTML table rows with the following code

use strict; use warnings; use HTML::TableExtract; use Data::Dumper; my $html = qq{ <HTML> <BODY> <table border="1"> <tr><td align="center" nowrap><font size="2"><u>Activity #</u><t +d align="center"><font size="2">Some&nbsp;ID<br>/Debit&nbsp;ID</font> +</td></tr> <tr><td align="right"><font size="2">588476377</font></td><td><f +ont size="2"><a href="/cgi-bin/page?id=1275591">1275591</a></font></t +d></tr> <tr><td align="right"><font size="2">588484813</font></td><td><f +ont size="2"><a href="/cgi-bin/page?id=1210540">1210540</a></font></t +d></tr> </table> </BODY> </HTML> }; my $te = HTML::TableExtract->new( headers => ['Some ID'] ); $te->parse($html); eval { $te->rows; }; if ( $@ ) { print "No rows found\n"; } print Dumper($te->rows);

When trying to extract with table headers like 'Invoice ID' which doesn't display '\ ' in the webpage, code is displaying as 'No rows found'. How can I handle to extract the data even when there are any spaces, '/' or any other characters inside the headers.

Replies are listed 'Best First'.
Re: Extract HTML rows with headers specified
by wfsp (Abbot) on Jan 29, 2010 at 08:49 UTC
    Changing
    ['Some ID']
    to
    ['Some&nbsp;ID']
    and it works ok here.

    Update: No it doesn't :-(
    But it's because Some ID isn't the same as Some&nbsp;ID (although it may look the same in the browser).

    Update2:

    my $header = q{Some} . chr(0x0A0) . q{ID}; my $te = HTML::TableExtract->new( headers => [$header] );
Re: Extract HTML rows with headers specified
by steve (Deacon) on Jan 29, 2010 at 19:27 UTC
    HTML::TableExtract indicates that there is a "decode" constructor attribute that is described as follows:
    Automatically decode retrieved text with HTML::Entities::decode_entities(). Enabled by default. Has no effect if keep_html was specified or if extracting into an element tree structure.

    The following works for me:
    my $html = qq{ <HTML> <BODY> <table border="1"> <tr><td align="center" nowrap><font size="2"><u>Activity #</u><t +d align="center"><font size="2">Some&nbsp;ID<br>/Debit&nbsp;ID</font> +</td></tr> <tr><td align="right"><font size="2">588476377</font></td><td><f +ont size="2"><a href="/cgi-bin/page?id=1275591">1275591</a></font></t +d></tr> <tr><td align="right"><font size="2">588484813</font></td><td><f +ont size="2"><a href="/cgi-bin/page?id=1210540">1210540</a></font></t +d></tr> </table> </BODY> </HTML> }; my $te = HTML::TableExtract->new( headers => ['Some&nbsp;ID'] , decode + => 0); $te->parse($html); eval { $te->rows; }; if ( $@ ) { print "No rows found\n"; } print Dumper($te->rows);
Re: Extract HTML rows with headers specified
by Anonymous Monk on Jan 29, 2010 at 08:58 UTC
    Try turn on debugging

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://820306]
Approved by ww
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-04-26 07:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found