Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

question about lookaheads and threatexpert/html parsing

by ejc1 (Novice)
on Mar 23, 2016 at 19:04 UTC ( [id://1158645]=perlquestion: print w/replies, xml ) Need Help??

ejc1 has asked for the wisdom of the Perl Monks concerning the following question:

Ok, so messing around with what Tangent provided, I was able to get this to work for most of the fields. However, I did a dumper \@nodes on a section of it and don't understand what I am looking at... I have this:
<ul><li>The following Internet Connections were established:</li></ul> <p><table class="tbl" cellpadding="5" cellspacing="0"> <tr><td class="cell_1_h">Server Name</td><td class="cell_1_h">Server P +ort</td><td class="cell_1_h">Connect as User</td><td class="cell_2_h" +>Connection Password</td></tr> <tr><td class="cell_1">127.0.0.2</td><td class="cell_1">80</td><td cla +ss="cell_1">127.0.0.2</td><td class="cell_2">127.0.0.2</td></tr> <tr><td class="cell_1">127.0.0.3</td><td class="cell_1">80</td><td cla +ss="cell_1">127.0.0.3</td><td class="cell_2">127.0.0.3</td></tr> <tr><td class="cell_1">127.0.0.4</td><td class="cell_1">80</td><td cla +ss="cell_1">127.0.0.4</td><td class="cell_2">127.0.0.4</td></tr>
but cant seem to get it. I am doing
$node->findvalue('tr') or next; $text =~ m/^The following Internet Connections were established/ or ne +xt; my @array = $node->findvalues('td/tr');
How do I interpret what Dumper \@nodes is telling me to make a proper node statement? Also, what is wrong with the node path I specified? Thanks!

Replies are listed 'Best First'.
Re: question about lookaheads and threatexpert/html parsing
by afoken (Chancellor) on Mar 23, 2016 at 21:08 UTC

    Don't even think of using RegExps. It won't work reliably.

    (Sure, if you generate the HTML, you can write in a way that can be "parsed" by RegExps. But then, you would simply generate data in a format that does not need a complex parser.)

    CPAN has several HTML parsers. One that is not that obvious is XML::LibXML. Its main purpose is parsing and generating XML, but it can also parse (and to some extend, generate) HTML. It supports XPath that easily allows tasks like "find all LI elements inside UL elements". From there, extracting the text from the LI elements is trivial.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: question about lookaheads and threatexpert/html parsing
by GrandFather (Saint) on Mar 23, 2016 at 20:02 UTC

    See the link Markup in the Monastery given at the bottom of node editing page you used to enter your question text and you will see that <code>...</code> tags can be used to wrap code and other stuff you don't want interpreted as HTML text.

    How about showing us 10 or so "lines" of real data and the code you have tried to far to solve the problem. See I know what I mean. Why don't you? for tips about how you should present that.

    My first take is that you should be using something like HTML::TreeBuilder to wrangle the input data.

    Premature optimization is the root of all job security
      Actually, that _was_ the real data....
      <ul><li>The following Host Name was requested from a host database:</l +i> <ul> <li>192.5.5.241</li> </ul></ul>
      Everything between the first <ul> after the host down to the first ul pairing at </ul></ul>. There is an unknown number of li line elements between these two ul statements. The precursor to that data chunk is the bit where it comments about "Host Name". view-source:http://www.threatexpert.com/report.aspx?md5=ab41b1e2db77cebd9e2779110ee3915d The above is a sample of the raw html file to be parsed.
Re: question about lookaheads and threatexpert/html parsing
by tangent (Parson) on Mar 23, 2016 at 23:49 UTC
    This is how you might do it with HTML::TreeBuilder::XPath:
    use Data::Dumper; use HTML::TreeBuilder::XPath; my $html = q| <ul><li>The following Host Names were requested from a host database:< +/li> <ul> <li>192.5.5.241</li> <li>192.5.5.242</li> </ul></ul> |; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse($html); $tree->eof; my @wanted; my @nodes = $tree->findnodes('//ul'); for my $node ( @nodes ) { my $text = $node->findvalue('li') or next; $text =~ m/^The following Host Name/ or next; @wanted = $node->findvalues('ul/li'); last; } print Dumper \@wanted;
    Output:
    $VAR1 = [ '192.5.5.241', '192.5.5.242' ];
Re: question about lookaheads and threatexpert/html parsing
by Anonymous Monk on Mar 23, 2016 at 21:50 UTC
      Thanks! I will look at this tomorrow when I get back to work!
Re: question about lookaheads and threatexpert/html parsing
by ejc1 (Novice) on Mar 23, 2016 at 19:23 UTC
    Ok, it totally ate the formatting. <ul><li>The following Host Names were requested from a host database:</li> <ul> <li>192.5.5.241</li> . . . </ul></ul>

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1158645]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2024-04-24 17:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found