question about lookaheads and threatexpert/html parsing

ejc1 has asked for the wisdom of the Perl Monks concerning the following question:

Ok, so messing around with what Tangent provided, I was able to get this to work for most of the fields. However, I did a dumper \@nodes on a section of it and don't understand what I am looking at... I have this:

<ul><li>The following Internet Connections were established:</li></ul>
<p><table class="tbl" cellpadding="5" cellspacing="0">
<tr><td class="cell_1_h">Server Name</td><td class="cell_1_h">Server P
+ort</td><td class="cell_1_h">Connect as User</td><td class="cell_2_h"
+>Connection Password</td></tr>
<tr><td class="cell_1">127.0.0.2</td><td class="cell_1">80</td><td cla
+ss="cell_1">127.0.0.2</td><td class="cell_2">127.0.0.2</td></tr>
<tr><td class="cell_1">127.0.0.3</td><td class="cell_1">80</td><td cla
+ss="cell_1">127.0.0.3</td><td class="cell_2">127.0.0.3</td></tr>
<tr><td class="cell_1">127.0.0.4</td><td class="cell_1">80</td><td cla
+ss="cell_1">127.0.0.4</td><td class="cell_2">127.0.0.4</td></tr>
[download]

but cant seem to get it. I am doing

$node->findvalue('tr') or next;
$text =~ m/^The following Internet Connections were established/ or ne
+xt;
my @array = $node->findvalues('td/tr');
[download]

How do I interpret what Dumper \@nodes is telling me to make a proper node statement? Also, what is wrong with the node path I specified? Thanks!

Comment on question about lookaheads and threatexpert/html parsing Select or Download Code

Replies are listed 'Best First'.
Re: question about lookaheads and threatexpert/html parsing by afoken (Chancellor) on Mar 23, 2016 at 21:08 UTC
Don't even think of using RegExps. It won't work reliably. (Sure, if you generate the HTML, you can write in a way that can be "parsed" by RegExps. But then, you would simply generate data in a format that does not need a complex parser.) CPAN has several HTML parsers. One that is not that obvious is XML::LibXML. Its main purpose is parsing and generating XML, but it can also parse (and to some extend, generate) HTML. It supports XPath that easily allows tasks like "find all LI elements inside UL elements". From there, extracting the text from the LI elements is trivial. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re: question about lookaheads and threatexpert/html parsing by GrandFather (Saint) on Mar 23, 2016 at 20:02 UTC
See the link Markup in the Monastery given at the bottom of node editing page you used to enter your question text and you will see that `<code>...</code>` tags can be used to wrap code and other stuff you don't want interpreted as HTML text. How about showing us 10 or so "lines" of real data and the code you have tried to far to solve the problem. See I know what I mean. Why don't you? for tips about how you should present that. My first take is that you should be using something like HTML::TreeBuilder to wrangle the input data. Premature optimization is the root of all job security	[reply] [d/l]
Re^2: question about lookaheads and threatexpert/html parsing by Anonymous Monk on Mar 23, 2016 at 22:05 UTC
Actually, that _was_ the real data.... `<ul><li>The following Host Name was requested from a host database:</l +i> <ul> <li>192.5.5.241</li> </ul></ul>` [download] Everything between the first `<ul>` after the host down to the first ul pairing at `</ul></ul>`. There is an unknown number of li line elements between these two ul statements. The precursor to that data chunk is the bit where it comments about "Host Name". view-source:http://www.threatexpert.com/report.aspx?md5=ab41b1e2db77cebd9e2779110ee3915d The above is a sample of the raw html file to be parsed.	[reply] [d/l] [select]
Re: question about lookaheads and threatexpert/html parsing by tangent (Parson) on Mar 23, 2016 at 23:49 UTC
This is how you might do it with HTML::TreeBuilder::XPath: use Data::Dumper; use HTML::TreeBuilder::XPath; my $html = q\| <ul><li>The following Host Names were requested from a host database:< +/li> <ul> <li>192.5.5.241</li> <li>192.5.5.242</li> </ul></ul> \|; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse($html); $tree->eof; my @wanted; my @nodes = $tree->findnodes('//ul'); for my $node ( @nodes ) { my $text = $node->findvalue('li') or next; $text =~ m/^The following Host Name/ or next; @wanted = $node->findvalues('ul/li'); last; } print Dumper \@wanted; [download] Output: `$VAR1 = [ '192.5.5.241', '192.5.5.242' ];` [download]	[reply] [d/l] [select]
Re: question about lookaheads and threatexpert/html parsing by Anonymous Monk on Mar 23, 2016 at 21:50 UTC
`$ cat junk.html <ul><li>The following Host Names were requested from a host database:< +/li> <ul> <li>192.5.5.241</li> . . . </ul></ul> $ cat jonk.xsh open --format html "junk.html"; # ls --indent /; for //ul { pwd; for ./li { pwd; print text(); }; echo; }; echo; $ xsh -q jonk.xsh /html/body/ul /html/body/ul/li The following Host Names were requested from a host database: /html/body/ul/ul /html/body/ul/ul/li 192.5.5.241` [download] See also xpather.pl/htmltreexpather.pl which can give you paths to start with, and all the links here Re: Retrieve select information from HTML, they're examples(for tree-xpath and others)/walkthroughs/tutorials ... XML::XSH2/https://metacpan.org/pod/distribution/XML-XSH2/XSH2.pod#open,	[reply] [d/l]
Re^2: question about lookaheads and threatexpert/html parsing by Anonymous Monk on Mar 23, 2016 at 22:07 UTC
Thanks! I will look at this tomorrow when I get back to work!	[reply]
Re: question about lookaheads and threatexpert/html parsing by ejc1 (Novice) on Mar 23, 2016 at 19:23 UTC
Ok, it totally ate the formatting. <ul><li>The following Host Names were requested from a host database:</li> <ul> <li>192.5.5.241</li> . . . </ul></ul>	[reply]


Keep It Simple, Stupid
	PerlMonks