Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^2: More robust link finding than HTML::LinkExtor/HTML::Parser?

by Allasso (Monk)
on May 08, 2011 at 11:33 UTC ( [id://903638]=note: print w/replies, xml ) Need Help??


in reply to Re: More robust link finding than HTML::LinkExtor/HTML::Parser?
in thread More robust link finding than HTML::LinkExtor/HTML::Parser?

Thank you for the links.

I wish to have a script that works independently of a browser. So I don't think WWW::Mechanize::Firefox will work for me, unless you were seeing a way that I could utilize this to come up with code for a script that works independently of Firefox. If so, please let me know.

The second link looks more promising, now I just have to try to figure out what Mozilla is doing here :-)

I believe that HTML::LinkExtor will work fine for extracting the links in the HTML robustly :-); I just need now to find a way to extract them from CSS and JS.
  • Comment on Re^2: More robust link finding than HTML::LinkExtor/HTML::Parser?

Replies are listed 'Best First'.
Re^3: More robust link finding than HTML::LinkExtor/HTML::Parser?
by Anonymous Monk on May 08, 2011 at 12:17 UTC
    The second link looks more promising, now I just have to try to figure out what Mozilla is doing here :-)

    The second link is for use with WWW::Mechanize::Firefox.

    You need some kind of browser, something to interpret the javascript, there is no way around that.

    The other candidate is WWW::Scripter, a WWW::Mechanize subclass, but its alpha version, and my simple test didn't yield anything useful, :)

    My other thought was go straight for the supporting module CSS::DOM, but that didn't work out. Same goes for CSS/CSS::SAC/CSS::Tiny.

    I figure this ought to be robust enough for css

    ## http://cpansearch.perl.org/src/NEVESENIN/CSS-Packer-1.000001/lib/CS +S/Packer.pm our $DICTIONARY = { 'STRING1' => qr~"(?>(?:(?>[^"\\]+)|\\.|\\"|\\\s)*)"~, 'STRING2' => qr~'(?>(?:(?>[^'\\]+)|\\.|\\'|\\\s)*)'~ }; our $URL = 'url\(\s*(' . $DICTIONARY->{STRING1} . '|' . $DI +CTIONARY->{STRING2} . '|[^\'"\s]+?)\s*\)'; our $IMPORT = '\@import\s+(' . $DICTIONARY->{STRING1} . '|' . +$DICTIONARY->{STRING2} . '|' . $URL . ')([^;]*);';

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://903638]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (3)
As of 2024-04-25 09:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found