http://qs321.pair.com?node_id=1159020


in reply to Re: Problem getting fields out of an XPath node list
in thread Problem getting fields out of an XPath node list

I managed to get it out badly with
my @nodes = $tree->findnodes('//tr'); for my $node (@nodes) { @text = $node->findvalues('td') or next; print Dumper \@text;
It is bad, in that I still have no clue what xpath is doing, despite reading documentation on it. It only works because as far as i can see, there is only one table... I am trying to get the following data parsed:
<ul><li>There was registered attempt to establish connection with the +remote host. The connection details are:</li></ul> <p><table class="tbl" cellpadding="5" cellspacing="0"> <tr><td class="cell_1_h">Remote Host</td><td class="cell_2_h">Port Num +ber</td></tr> <tr><td class="cell_1">192.5.5.241<td class="cell_2">8091</td><tr> </table></p>
but am having zero luck. I am trying:
my @nodes = $tree->findnodes('//ul'); for my $node ?(@nodes) { my $text2 = $node->findvalue('li') or next; if ($text2 =~ m/connection details are/) { print "$text2\n"; my @text = $node->findvalues('/tr/td'); print @Dumper \@text; } }
The problem is, it clearly finds the li node, matches it, and then tries to run the findvalues against /tr/td. This totally doesn't work.... I have tried '/tr/td', '//tr/td', 'td', and cant get any of them to work at all. The total format of the section, as pasted from above is:
<ul><li>....connection...</li></ul> <p><table> <tr><td>stuff</td><td>stuff2</td></tr> . . . </table> </p>
What the heck is the xpath of the items below that section? Is it even possible to match this? I totally dont understand xpath at all....

Replies are listed 'Best First'.
Re^3: Problem getting fields out of an XPath node list
by Corion (Patriarch) on Mar 29, 2016 at 14:32 UTC

    Why do you keep using ->findvalues ? Simply retrieve the nodes or find the text within the nodes explicitly:

    /tr/td/text()

    Personally, I simply find the nodes and then use their ->as_text() method to get at their textual content.

      I've tried to look at the raw nodes with dumper, and I cant make any sense of it. The document is very complex (see http://www.threatexpert.com/report.aspx?md5=2aafcad88572d98c154ab7d80cbafc02) and as I mentioned, I have zero understanding of xpath. I looked at as_text, but the problem is, I just don't understand xpath format at all, to even attempt to scope my node elements to just that one section I mentioned. If I understood how the nodes were built, I think I could be ok, but to be honest, I just totally don't get this at all. When I do '//tr/td', I get _all_ of the td elements in one giant array, instead of just narrowing the damn thing to the one section I tried to match against in my post. :(

        I recommend that you learn XPath.

        There are also browser plugins that show you the XPath to a node if you click on its HTML element.

        If XPath feels too complex for you to tackle but HTML / CSS selectors feel more accessible to you, you can easily convert most CSS selectors to XPath by using HTML::Selector::XPath.

Re^3: Problem getting fields out of an XPath node list
by tangent (Parson) on Mar 29, 2016 at 15:51 UTC
    Finding the value of the list element is not really helping you as the table is not an element of the list. If you know there is only one table, this verbose example may help:
    # get all the tables my @tables = $tree->findnodes('//table'); # get the first table my $table = $tables[0]; # get all the rows of first table my @rows = $table->findnodes('tr'); # loop through the rows for my $row ( @rows ) { # get all the cells my @cells = $row->findnodes('td'); # loop through the cells for my $cell ( @cells ) { print $cell->as_text, "\n"; } } Output: Remote Host Port Number 192.5.5.241 8091
      So, I don't understand why $cell->as_text gives the data, when Dumper \@cells prints a giant ton of garbage. Also, even though I have specified the table element as
      my @tables = $tree->findnodes('//table'); my $table = $tables[12];
      I cant reference this directly. Printing @cells[2]->as_text fails outright with "can't call method 'as_text' on an undefined value". It is clearly in there as
      my @cells = $row->findnodes('td')
      .... Anything I do to @cells flat out fails except for looking through with the mentioned
      for my $cell (@cells)... print $cell->as_text
      At this junction, I am about to totally give up on this, since I do not understand this at all and have no other way I can parse this otherwise. Since as_text dumps this one entry at a time, I was hoping to process the even elements of @cells as host/ip address and the odd as the previous elements port. But I just don't get this at all.
        Note that the HTML block you provide is not valid (missing </td>, and a <tr> instead of </tr>). It would help if you showed us your desired output. It may be better to build up a Perl data structure first and then extract the values you need:
        # ... as before my @aoa; for my $row ( @rows ) { my @cells = $row->findnodes('td'); my @ary = map { $_->as_text } @cells; push( @aoa, \@ary ); } print Dumper( \@aoa ); print "Headers:\n"; my $headers = shift @aoa; print "$headers->[0], $headers->[1]\n"; print "Rows:\n"; for my $ary ( @aoa ) { print "$headers->[0]: $ary->[0], $headers->[1]: $ary->[1]\n"; } Output: $VAR1 = [ [ 'Remote Host', 'Port Number' ], [ '192.5.5.241', '8091' ] ]; Headers: Remote Host, Port Number Rows: Remote Host: 192.5.5.241, Port Number: 8091