http://qs321.pair.com?node_id=814093

kalyanrajsista has asked for the wisdom of the Perl Monks concerning the following question:

hello all

I'm trying to extract table rows from following HTML. It is showing the desired output but is there any other way that I can map my table rows with key-value pair or array of arrays of td elements under a tr

<html><head><title>Person Profile</title></head> <center> <font size=5><b>Profile</b></font> <table cellspacing="1" cellpadding="1"> <tr> <td class="rlab">Short Name:</td> <td class="l">John</td> </tr> <tr> <td class="rlab">Long Name:</td> <td class="l">John Abraham</td> </tr> <tr> <td class="rlab">Company:</td> <td class="l">Idea</a></td> </tr> </tr> <tr> <td class="rlab">Currency:</td> <td class="l">EUR</td> </tr> </table> </body></html>

I'm trying the following code

use strict; use warnings; use HTML::TreeBuilder; #Parse html content using html-treebuilder: my $root = HTML::TreeBuilder->new(); $root->parse($html); $root->eof(); my @tables = $root->look_down(_tag => 'table'); while (@tables) { my $node = shift @tables; if (ref $node) { unshift @tables, $node->content_list; } else { print $node,"\n"; } } $root = $root->delete;

OUTPUT is

---------- Perl ---------- Short Name: John Long Name: John Abraham Company: Idea Currency: EUR Output completed (0 sec consumed) - Normal Termination

Replies are listed 'Best First'.
Re: Extract HTML Table rows
by bobf (Monsignor) on Dec 23, 2009 at 14:44 UTC

    I have found HTML::TableExtract to be easy to use in simple cases:

    use strict; use warnings; use HTML::TableExtract; my $content; { local $/ = undef; # slurp mode $content = <DATA>; } my $te = HTML::TableExtract->new(); $te->parse( $content ); foreach my $ts ( $te->tables() ) { foreach my $row ( $ts->rows() ) { print join ( "\t", @$row ), "\n"; } } __DATA__ <html><head><title>Person Profile</title></head> <center> <font size=5><b>Profile</b></font> <table cellspacing="1" cellpadding="1"> <tr> <td class="rlab">Short Name:</td> <td class="l">John</td> </tr> <tr> <td class="rlab">Long Name:</td> <td class="l">John Abraham</td> </tr> <tr> <td class="rlab">Company:</td> <td class="l">Idea</a></td> </tr> </tr> <tr> <td class="rlab">Currency:</td> <td class="l">EUR</td> </tr> </table> </body></html>

Re: Extract HTML Table rows
by suaveant (Parson) on Dec 23, 2009 at 14:40 UTC
    There are modules specifically to handle html tables...

    HTML::TableExtract
    HTML::TableParser

                    - Ant
                    - Some of my best work - (1 2 3)

Re: Extract HTML Table rows
by wfsp (Abbot) on Dec 23, 2009 at 15:16 UTC
    The modules recommended by suaveant and bobf are a good bet. If you wanted to use HTML::TreeBuilder the following would be one way to do it.
    #! /usr/bin/perl use strict; use warnings; use Data::Dumper; $Data::Dumper::Indent=1; use HTML::TreeBuilder; my $t = HTML::TreeBuilder->new_from_file(*DATA); my ($table) = $t->look_down(_tag => q{table}); my @rows = $table->look_down(_tag => q{tr}); my %db; for my $row (@rows){ my $key = $row->look_down(class => q{rlab})->as_text; my $value = $row->look_down(class => q{l})->as_text; $db{$key} = $value; } for my $key (keys %db){ printf qq{%s -> %s\n}, $key, $db{$key}; } __DATA__ <html><head><title>Person Profile</title></head> <center> <font size=5><b>Profile</b></font> <table cellspacing="1" cellpadding="1"> <tr> <td class="rlab">Short Name:</td> <td class="l">John</td> </tr> <tr> <td class="rlab">Long Name:</td> <td class="l">John Abraham</td> </tr> <tr> <td class="rlab">Company:</td> <td class="l">Idea</a></td> </tr> </tr> <tr> <td class="rlab">Currency:</td> <td class="l">EUR</td> </tr> </table> </body></html>
    Company: -> Idea Long Name: -> John Abraham Currency: -> EUR Short Name: -> John
    I've assumed that
    • there is one table,
    • each row has two columns each with a class as in your sample data
    You would probably want to include some error checking to confirm those assumptions though.