Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Problem getting fields out of an XPath node list

by ejc1 (Novice)
on Mar 28, 2016 at 18:57 UTC ( #1158973=perlquestion: print w/replies, xml ) Need Help??

ejc1 has asked for the wisdom of the Perl Monks concerning the following question:

I have a piece of html::treebuilder::xpath code that sort of works, but isn't populating the node values properly. Here is what I am doing:
my @nodes = $tree->findnodes('//tr'); for my $node (@nodes) { my @text = $node->findvalue('td') or next; print dumper \@text; next; my @node_list = $node->findvalues('td/tr'); last; }
When I print dumper @text (I get the same answer using a string instead of an array), it just prints all of the values in the node shoved together without breaking it into an array or anything. No matter what I do, I can't get it into @node_list, even though each dumper print line tells me it found a node. Here is the content I am trying to parse:
#(reference, not parsing) <tr><th>Firstseen (UTC)</th><th>Version</th><th>Feodo C&amp;C</th><th> +Status</th><th>SBL</th><th>ASN</th><th>Country</th><th>Lastseen (UTC) +</th></tr> #parsing the below <tr bgcolor="#9d9595" onmouseover="this.style.backgroundColor='#FFA200 +';" onmouseout="this.style.backgroundColor='#9d9595';"><td>2016-03-19 + 23:44:36</td><td bgcolor="#58D3F7" align="center"><strong>D</strong> +</td><td><a href="/host/83.172.215.87/" target="_parent" title="Show +more information about this Feodo C&amp;C">83.172.215.87</a></td><td +bgcolor="#4f883f">offline</td><td bgcolor="#bc5959"><a href="http://w +ww.spamhaus.org/sbl/sbl.lasso?query=SBL290535" target="_blank" title= +"Spamhaus SBL: SBL290535">SBL290535</a></td><td>AS12651 IPWORLDCOM</t +d><td><img src="images/flags/ch.gif" alt="-" title="CH (CH)" width="1 +6" height="10" /> CH</td><td>never</td></tr> <tr bgcolor="#837b7b" onmouseover="this.style.backgroundColor='#FFA200 +';" onmouseout="this.style.backgroundColor='#837b7b';"><td>2016-03-19 + 23:44:36</td><td bgcolor="#58D3F7" align="center"><strong>D</strong> +</td><td><a href="/host/98.23.159.86/" target="_parent" title="Show m +ore information about this Feodo C&amp;C">98.23.159.86</a></td><td bg +color="#4f883f">offline</td><td bgcolor="#4f883f">Not listed</td><td> +AS7029 WINDSTREAM</td><td><img src="images/flags/us.gif" alt="-" titl +e="US (US)" width="16" height="10" /> US</td><td>never</td></tr> <tr bgcolor="#9d9595" onmouseover="this.style.backgroundColor='#FFA200 +';" onmouseout="this.style.backgroundColor='#9d9595';"><td>2016-03-19 + 23:44:36</td><td bgcolor="#58D3F7" align="center"><strong>D</strong> +</td><td><a href="/host/178.188.14.86/" target="_parent" title="Show +more information about this Feodo C&amp;C">178.188.14.86</a></td><td +bgcolor="#4f883f">offline</td><td bgcolor="#4f883f">Not listed</td><t +d>AS8447 TELEKOM-AT</td><td><img src="images/flags/at.gif" alt="-" ti +tle="AT (AT)" width="16" height="10" /> AT</td><td>2016-03-24 01:19:5 +0</td></tr>
Thank you for your assistance! I am trying to break out the first seen, ip address, and offline or online status. At the moment, I can't seem to get it to even populate these values as an array in @text to push into a hash.

Replies are listed 'Best First'.
Re: Problem getting fields out of an XPath node list
by Corion (Patriarch) on Mar 28, 2016 at 19:07 UTC

    You're using ->findvalue('td');. I recommend using a second ->findnodes() again, and then using ->as_text:

    use strict; use HTML::TreeBuilder::XPath; my $html = <<'HTML'; <html><body> <table> <tr><th>Firstseen (UTC)</th><th>Version</th><th>Feodo C&amp;C</th><th> +Status</th><th>SBL</th><th>ASN</th><th>Country</th><th>Lastseen (UTC) +</th></tr> #parsing the below <tr bgcolor="#9d9595" onmouseover="this.style.backgroundColor='#FFA200 +';" onmouseout="this.style.backgroundColor='#9d9595';"><td>2016-03-19 + 23:44:36</td><td bgcolor="#58D3F7" align="center"><strong>D</strong> +</td><td><a href="/host/83.172.215.87/" target="_parent" title="Show +more information about this Feodo C&amp;C">83.172.215.87</a></td><td +bgcolor="#4f883f">offline</td><td bgcolor="#bc5959"><a href="http://w +ww.spamhaus.org/sbl/sbl.lasso?query=SBL290535" target="_blank" title= +"Spamhaus SBL: SBL290535">SBL290535</a></td><td>AS12651 IPWORLDCOM</t +d><td><img src="images/flags/ch.gif" alt="-" title="CH (CH)" width="1 +6" height="10" /> CH</td><td>never</td></tr> <tr bgcolor="#837b7b" onmouseover="this.style.backgroundColor='#FFA200 +';" onmouseout="this.style.backgroundColor='#837b7b';"><td>2016-03-19 + 23:44:36</td><td bgcolor="#58D3F7" align="center"><strong>D</strong> +</td><td><a href="/host/98.23.159.86/" target="_parent" title="Show m +ore information about this Feodo C&amp;C">98.23.159.86</a></td><td bg +color="#4f883f">offline</td><td bgcolor="#4f883f">Not listed</td><td> +AS7029 WINDSTREAM</td><td><img src="images/flags/us.gif" alt="-" titl +e="US (US)" width="16" height="10" /> US</td><td>never</td></tr> <tr bgcolor="#9d9595" onmouseover="this.style.backgroundColor='#FFA200 +';" onmouseout="this.style.backgroundColor='#9d9595';"><td>2016-03-19 + 23:44:36</td><td bgcolor="#58D3F7" align="center"><strong>D</strong> +</td><td><a href="/host/178.188.14.86/" target="_parent" title="Show +more information about this Feodo C&amp;C">178.188.14.86</a></td><td +bgcolor="#4f883f">offline</td><td bgcolor="#4f883f">Not listed</td><t +d>AS8447 TELEKOM-AT</td><td><img src="images/flags/at.gif" alt="-" ti +tle="AT (AT)" width="16" height="10" /> AT</td><td>2016-03-24 01:19:5 +0</td></tr> </table> </body></html> HTML my $p = HTML::TreeBuilder->new; my $tree = $p->parse($html); my @nodes = $tree->findnodes('//tr'); use Data::Dumper; for my $node (@nodes) { my @text = $node->findnodes('td') or next; for (@text) { print $_->as_text, "\n"; }; }

    Maybe you want to be more specific with your XPath expressions to extract the cells directly. For example /tr/td[1] for first seen etc. . Also see HTML::TableExtract.

Re: Problem getting fields out of an XPath node list
by CountZero (Bishop) on Mar 29, 2016 at 07:23 UTC
    Your program never reaches the my @node_list = $node->findvalues('td/tr') due to the next on the preceding line, so @node_list never gets populated.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
      I managed to get it out badly with
      my @nodes = $tree->findnodes('//tr'); for my $node (@nodes) { @text = $node->findvalues('td') or next; print Dumper \@text;
      It is bad, in that I still have no clue what xpath is doing, despite reading documentation on it. It only works because as far as i can see, there is only one table... I am trying to get the following data parsed:
      <ul><li>There was registered attempt to establish connection with the +remote host. The connection details are:</li></ul> <p><table class="tbl" cellpadding="5" cellspacing="0"> <tr><td class="cell_1_h">Remote Host</td><td class="cell_2_h">Port Num +ber</td></tr> <tr><td class="cell_1">192.5.5.241<td class="cell_2">8091</td><tr> </table></p>
      but am having zero luck. I am trying:
      my @nodes = $tree->findnodes('//ul'); for my $node ?(@nodes) { my $text2 = $node->findvalue('li') or next; if ($text2 =~ m/connection details are/) { print "$text2\n"; my @text = $node->findvalues('/tr/td'); print @Dumper \@text; } }
      The problem is, it clearly finds the li node, matches it, and then tries to run the findvalues against /tr/td. This totally doesn't work.... I have tried '/tr/td', '//tr/td', 'td', and cant get any of them to work at all. The total format of the section, as pasted from above is:
      <ul><li>....connection...</li></ul> <p><table> <tr><td>stuff</td><td>stuff2</td></tr> . . . </table> </p>
      What the heck is the xpath of the items below that section? Is it even possible to match this? I totally dont understand xpath at all....

        Why do you keep using ->findvalues ? Simply retrieve the nodes or find the text within the nodes explicitly:

        /tr/td/text()

        Personally, I simply find the nodes and then use their ->as_text() method to get at their textual content.

        Finding the value of the list element is not really helping you as the table is not an element of the list. If you know there is only one table, this verbose example may help:
        # get all the tables my @tables = $tree->findnodes('//table'); # get the first table my $table = $tables[0]; # get all the rows of first table my @rows = $table->findnodes('tr'); # loop through the rows for my $row ( @rows ) { # get all the cells my @cells = $row->findnodes('td'); # loop through the cells for my $cell ( @cells ) { print $cell->as_text, "\n"; } } Output: Remote Host Port Number 192.5.5.241 8091
Re: Problem getting fields out of an XPath node list
by Gangabass (Vicar) on Apr 03, 2016 at 14:07 UTC
    HTML::TreeBuilder::XPath is too slow, contain memory leaks and buggy. So I recommend to use HTML::TreeBuilder::LibXML instead:
    use strict; use HTML::TreeBuilder::LibXML; use Data::Dumper; my $html = <<'HTML'; <html><body> <table> <tr><th>Firstseen (UTC)</th><th>Version</th><th>Feodo C&amp;C</th><th> +Status</th><th>SBL</th><th>ASN</th><th>Country</th><th>Lastseen (UTC) +</th></tr> #parsing the below <tr bgcolor="#9d9595" onmouseover="this.style.backgroundColor='#FFA200 +';" onmouseout="this.style.backgroundColor='#9d9595';"><td>2016-03-19 + 23:44:36</td><td bgcolor="#58D3F7" align="center"><strong>D</strong> +</td><td><a href="/host/83.172.215.87/" target="_parent" title="Show +more information about this Feodo C&amp;C">83.172.215.87</a></td><td +bgcolor="#4f883f">offline</td><td bgcolor="#bc5959"><a href="http://w +ww.spamhaus.org/sbl/sbl.lasso?query=SBL290535" target="_blank" title= +"Spamhaus SBL: SBL290535">SBL290535</a></td><td>AS12651 IPWORLDCOM</t +d><td><img src="images/flags/ch.gif" alt="-" title="CH (CH)" width="1 +6" height="10" /> CH</td><td>never</td></tr> <tr bgcolor="#837b7b" onmouseover="this.style.backgroundColor='#FFA200 +';" onmouseout="this.style.backgroundColor='#837b7b';"><td>2016-03-19 + 23:44:36</td><td bgcolor="#58D3F7" align="center"><strong>D</strong> +</td><td><a href="/host/98.23.159.86/" target="_parent" title="Show m +ore information about this Feodo C&amp;C">98.23.159.86</a></td><td bg +color="#4f883f">offline</td><td bgcolor="#4f883f">Not listed</td><td> +AS7029 WINDSTREAM</td><td><img src="images/flags/us.gif" alt="-" titl +e="US (US)" width="16" height="10" /> US</td><td>never</td></tr> <tr bgcolor="#9d9595" onmouseover="this.style.backgroundColor='#FFA200 +';" onmouseout="this.style.backgroundColor='#9d9595';"><td>2016-03-19 + 23:44:36</td><td bgcolor="#58D3F7" align="center"><strong>D</strong> +</td><td><a href="/host/178.188.14.86/" target="_parent" title="Show +more information about this Feodo C&amp;C">178.188.14.86</a></td><td +bgcolor="#4f883f">offline</td><td bgcolor="#4f883f">Not listed</td><t +d>AS8447 TELEKOM-AT</td><td><img src="images/flags/at.gif" alt="-" ti +tle="AT (AT)" width="16" height="10" /> AT</td><td>2016-03-24 01:19:5 +0</td></tr> </table> </body></html> HTML my $tree = HTML::TreeBuilder::LibXML->new; $tree->parse($html); $tree->eof; my @tr_nodes = $tree->findnodes('//tr[td]'); foreach my $tr_node (@tr_nodes) { my @text = $tr_node->findvalues('td'); #my @text = $tr_node->findvalue('td'); #compare with this one! fin +dvalue will contact all nodes for you print Dumper( \@text ); #do something with @text... }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1158973]
Approved by Corion
Front-paged by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (1)
As of 2022-07-07 03:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?