Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^2: Crawling Relative Links from Webpages

by listanand (Sexton)
on May 08, 2010 at 13:41 UTC ( [id://839012]=note: print w/replies, xml ) Need Help??


in reply to Re: Crawling Relative Links from Webpages
in thread Crawling Relative Links from Webpages

OK so maybe I am missing something here, because I am just unable to understand what's being said :(

$mech above uses a hard coded link, which would of course work for this page. What about those from other domains (say "xyz.com")?

How do I make the method generalizable?

  • Comment on Re^2: Crawling Relative Links from Webpages

Replies are listed 'Best First'.
Re^3: Crawling Relative Links from Webpages
by Corion (Patriarch) on May 08, 2010 at 14:34 UTC

    There is only one hard-coded address in the code:

    my $mech = WWW::Mechanize->new(); $mech->get("http://dspace.mit.edu/handle/1721.1/53720");

    If you want to make that variable, maybe you want to pass the starting link from the command line? It will then be available via @ARGV:

    my $mech = WWW::Mechanize->new(); warn "Fetching $ARGV[0]\n"; $mech->get($ARGV[0]);

    Call it as

    perl -w listanand.pl http://google.com
      Ah yes of course. What was I even saying. I get it now.

      Thank you very much everyone. This has solved my problem !

      Although I still get a warning "Use of uninitialized value in string eq at crawler.pl line <line where I check for pdf mime type>". Makes me wonder...

      Andy

        I still get a warning "Use of uninitialized value in string eq at crawler.pl

        This line-

        no warnings "uninitialized";

        -isn't for show. :) A path that is "dir" -- like / -- will not have a mime type and various other paths will fail to be found too.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://839012]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-04-23 06:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found