Extract HTML Table rows

kalyanrajsista has asked for the wisdom of the Perl Monks concerning the following question:

hello all

I'm trying to extract table rows from following HTML. It is showing the desired output but is there any other way that I can map my table rows with key-value pair or array of arrays of td elements under a tr

<html><head><title>Person Profile</title></head>
<center>
<font size=5><b>Profile</b></font>

<table cellspacing="1" cellpadding="1">
<tr>
    <td class="rlab">Short Name:</td>
    <td class="l">John</td>
</tr>
<tr>
    <td class="rlab">Long Name:</td>
    <td class="l">John Abraham</td>
</tr>
<tr>
    <td class="rlab">Company:</td>
    <td class="l">Idea</a></td>
</tr>
</tr>
<tr>
    <td class="rlab">Currency:</td>
    <td class="l">EUR</td>
</tr>
</table>
</body></html>
[download]

I'm trying the following code

use strict;
use warnings;
use HTML::TreeBuilder;

#Parse html content using html-treebuilder:
my $root = HTML::TreeBuilder->new();
$root->parse($html);
$root->eof();

my @tables = $root->look_down(_tag => 'table');
while (@tables) {
    my $node = shift @tables;
    if (ref $node) {
        unshift @tables, $node->content_list;
    }
    else {
        print $node,"\n";
    }
}
$root = $root->delete;
[download]

OUTPUT is

---------- Perl ----------
Short Name:
John
Long Name:
John Abraham
Company:
Idea
Currency:
EUR

Output completed (0 sec consumed) - Normal Termination
[download]

Comment on Extract HTML Table rows Select or Download Code

Replies are listed 'Best First'.
Re: Extract HTML Table rows by bobf (Monsignor) on Dec 23, 2009 at 14:44 UTC
I have found HTML::TableExtract to be easy to use in simple cases: use strict; use warnings; use HTML::TableExtract; my $content; { local $/ = undef; # slurp mode $content = <DATA>; } my $te = HTML::TableExtract->new(); $te->parse( $content ); foreach my $ts ( $te->tables() ) { foreach my $row ( $ts->rows() ) { print join ( "\t", @$row ), "\n"; } } __DATA__ <html><head><title>Person Profile</title></head> <center> <font size=5><b>Profile</b></font> <table cellspacing="1" cellpadding="1"> <tr> <td class="rlab">Short Name:</td> <td class="l">John</td> </tr> <tr> <td class="rlab">Long Name:</td> <td class="l">John Abraham</td> </tr> <tr> <td class="rlab">Company:</td> <td class="l">Idea</a></td> </tr> </tr> <tr> <td class="rlab">Currency:</td> <td class="l">EUR</td> </tr> </table> </body></html> [download]	[reply] [d/l]
Re: Extract HTML Table rows by suaveant (Parson) on Dec 23, 2009 at 14:40 UTC
There are modules specifically to handle html tables... HTML::TableExtract HTML::TableParser - Ant - Some of my best work - (1 2 3)	[reply]
Re: Extract HTML Table rows by wfsp (Abbot) on Dec 23, 2009 at 15:16 UTC
The modules recommended by suaveant and bobf are a good bet. If you wanted to use HTML::TreeBuilder the following would be one way to do it. #! /usr/bin/perl use strict; use warnings; use Data::Dumper; $Data::Dumper::Indent=1; use HTML::TreeBuilder; my $t = HTML::TreeBuilder->new_from_file(*DATA); my ($table) = $t->look_down(_tag => q{table}); my @rows = $table->look_down(_tag => q{tr}); my %db; for my $row (@rows){ my $key = $row->look_down(class => q{rlab})->as_text; my $value = $row->look_down(class => q{l})->as_text; $db{$key} = $value; } for my $key (keys %db){ printf qq{%s -> %s\n}, $key, $db{$key}; } __DATA__ <html><head><title>Person Profile</title></head> <center> <font size=5><b>Profile</b></font> <table cellspacing="1" cellpadding="1"> <tr> <td class="rlab">Short Name:</td> <td class="l">John</td> </tr> <tr> <td class="rlab">Long Name:</td> <td class="l">John Abraham</td> </tr> <tr> <td class="rlab">Company:</td> <td class="l">Idea</a></td> </tr> </tr> <tr> <td class="rlab">Currency:</td> <td class="l">EUR</td> </tr> </table> </body></html> [download] `Company: -> Idea Long Name: -> John Abraham Currency: -> EUR Short Name: -> John` [download] I've assumed that there is one table, each row has two columns each with a class as in your sample data You would probably want to include some error checking to confirm those assumptions though.	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom