Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: Using HTTP::LinkExtor to get URL and description info

by crazyinsomniac (Prior)
on Aug 08, 2002 at 04:39 UTC ( #188520=note: print w/replies, xml ) Need Help??

in reply to Using HTTP::LinkExtor to get URL and description info

You have to know your tools. HTML::LinkExtor was designed to only extract the links, not the text in between (whatever you call it, cdata or whatever).


use strict; use Data::Dumper; use HTML::LinkExtor; my $base = ''; my $stringy = q{ <tr><td><a HREF="/">How does this code work (w</a></td> <td>by <a HREF="/">John +M. Dlugosz</a></td></tr> <tr><td><a HREF="/">Tk and X events</a></td> < +td>by <a HREF="/">Anonymous Monk</a></td></tr> <tr><td><a HREF="/">warnings::warnif etc. wise + usage?</a></td> <td>by <a HREF="/">John M. Dl +ugosz</a></td></tr> <tr><td><a HREF="/">52-bit numbers as floating + point</a></td> <td>by <a HREF="/">John M. Dlu +gosz</a></td></tr> }; my $p = new HTML::LinkExtor(undef, $base); $p->parse($stringy); print Dumper $p->links; $p = new HTML::LinkExtor( sub { print Dumper($_) for @_; } , $base); $p->parse($stringy);
And now for the nudge, HTML::TokeParser tutorial

update: suprise, suprise, I've solved this one before (crazyinsomniac) Re: Getting the Linking Text from a page

Of all the things I've lost, I miss my mind the most.
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

Replies are listed 'Best First'.
Re: Re: Using HTTP::LinkExtor to get URL and description info
by Popcorn Dave (Abbot) on Aug 08, 2002 at 05:58 UTC
    Thanks for that!

    I'll be looking at that tomorrow for certain, but I do have one question. My program is taking headlines off of newspaper sites, but at the moment I'm using LWP::Simple with get(URL), dumping it in to an array, then reading through to a certain pre-determined point, and then using a regex to get the info I want.

    Is HTML::TokeParser going to allow me to do that type of thing or will I have to write new "rules" to determine what is a headline and what is just a link on the page?

    Thanks again!

    Some people fall from grace. I prefer a running start...

      I would suggest the CPAN module HTML::Parser. It's pretty straightforward:
      use HTML::Parser; $p = new HTML::Parser(start_h => [\&start, "tagname"], end_h => [\&end, "tagname"], default_h => [\&default, "text"]); $p->parse($some_html); $p->parsefile(\*SOME_FH); sub start { my ($tagname) = @_; $in_a = 1 if $tagname eq 'a'; } sub end { my ($tagname) = @_; $in_a = 0 if $tagname eq 'a'; } sub default { my ($text) = @_; # do something with text if $in_a }
      HTH. Off the top of my head. Check the HTML::Parser PoD for absolute correctness.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://188520]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2022-09-30 22:08 GMT
Find Nodes?
    Voting Booth?
    I prefer my indexes to start at:

    Results (126 votes). Check out past polls.