P is for Practical | |
PerlMonks |
Crawling Relative Links from Webpagesby listanand (Sexton) |
on May 08, 2010 at 00:14 UTC ( [id://838975]=perlquestion: print w/replies, xml ) | Need Help?? |
listanand has asked for the wisdom of the Perl Monks concerning the following question:
Hello perlmonks, I am trying to use WWW::Mechanize to build a crawler so that it is able to crawl a (large) set of webpages and able to pull out all the pdfs that each webpage hosts. I am running into some trouble with crawling relative links out of webpages. Some webpages have only "relative" links for certain types of files, and my crawler does not get them right. I use Mechanize to retrieve the base URL ($mech->base()) and then append it to the HREF entry of the PDF but that does not seem to work either. I am writing the crawler for doing crawls internally, but here's one example webpage on WWW that would be a case in point: http://dspace.mit.edu/handle/1721.1/53720 So the question is how do I adapt my crawler to crawl PDFs from such webpages? Any suggestions will be very gratefully appreciated.
Thank you. Andy PS: By the way, I tried using HTML::LinkExtor also, and it does not work either. It does not produce "right" URL for the PDF. It again appends the "base" URL to the relative URL just as I did manually with Mechanize above.
Back to
Seekers of Perl Wisdom
|
|