Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: More robust link finding than HTML::LinkExtor/HTML::Parser?

by Anonymous Monk
on May 08, 2011 at 03:18 UTC ( [id://903611]=note: print w/replies, xml ) Need Help??


in reply to More robust link finding than HTML::LinkExtor/HTML::Parser?

HTML::LinkExtor / HTML::Parser are robust. They do a different job, but they are robust. Implying they aren't robust is poor form.

See WWW::Mechanize::Firefox and http://mxr.mozilla.org/firefox/source/browser/base/content/pageinfo/pageInfo.js

  • Comment on Re: More robust link finding than HTML::LinkExtor/HTML::Parser?

Replies are listed 'Best First'.
Re^2: More robust link finding than HTML::LinkExtor/HTML::Parser?
by Allasso (Monk) on May 08, 2011 at 10:47 UTC
    HTML::LinkExtor / HTML::Parser are robust. They do a different job, but they are robust. Implying they aren't robust is poor form.

    Yes, I agree. I was not mindful of the wording of my question.
Re^2: More robust link finding than HTML::LinkExtor/HTML::Parser?
by Allasso (Monk) on May 08, 2011 at 11:33 UTC
    Thank you for the links.

    I wish to have a script that works independently of a browser. So I don't think WWW::Mechanize::Firefox will work for me, unless you were seeing a way that I could utilize this to come up with code for a script that works independently of Firefox. If so, please let me know.

    The second link looks more promising, now I just have to try to figure out what Mozilla is doing here :-)

    I believe that HTML::LinkExtor will work fine for extracting the links in the HTML robustly :-); I just need now to find a way to extract them from CSS and JS.
      The second link looks more promising, now I just have to try to figure out what Mozilla is doing here :-)

      The second link is for use with WWW::Mechanize::Firefox.

      You need some kind of browser, something to interpret the javascript, there is no way around that.

      The other candidate is WWW::Scripter, a WWW::Mechanize subclass, but its alpha version, and my simple test didn't yield anything useful, :)

      My other thought was go straight for the supporting module CSS::DOM, but that didn't work out. Same goes for CSS/CSS::SAC/CSS::Tiny.

      I figure this ought to be robust enough for css

      ## http://cpansearch.perl.org/src/NEVESENIN/CSS-Packer-1.000001/lib/CS +S/Packer.pm our $DICTIONARY = { 'STRING1' => qr~"(?>(?:(?>[^"\\]+)|\\.|\\"|\\\s)*)"~, 'STRING2' => qr~'(?>(?:(?>[^'\\]+)|\\.|\\'|\\\s)*)'~ }; our $URL = 'url\(\s*(' . $DICTIONARY->{STRING1} . '|' . $DI +CTIONARY->{STRING2} . '|[^\'"\s]+?)\s*\)'; our $IMPORT = '\@import\s+(' . $DICTIONARY->{STRING1} . '|' . +$DICTIONARY->{STRING2} . '|' . $URL . ')([^;]*);';

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://903611]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (4)
As of 2024-04-24 21:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found