Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: LWP::UserAgent Bad and Forbidden requests

by 1arryb (Acolyte)
on Dec 15, 2011 at 19:27 UTC ( [id://943815]=note: print w/replies, xml ) Need Help??


in reply to LWP::UserAgent Bad and Forbidden requests

Hi taioba,

You have run afoul of the Robots Exclusion Protocol. Many websites prefer that real humans with real eyeballs to visit their site. Some feel strongly enough to ban software "robots" such as LWP::UserAgent. Sciencedirect.com is one of these. If you look at the robots.txt file for sciencedirect.com, you'll see they only let the big boys (Google, et. al.) spider their site. All others (including you) can go suck rocks. There is no (legit) solution to this problem except to call the webmasters and convince them that it is in their interest to allow your program to crawl their site. Good luck with that. Alternatively, see if the site has an RSS data feed or API that provides the data you need. APIs especially are less subject to interdiction by webmasters, since they are designed for program-to-program integration.

Cheers,

Larry

  • Comment on Re: LWP::UserAgent Bad and Forbidden requests

Replies are listed 'Best First'.
Re^2: LWP::UserAgent Bad and Forbidden requests
by Corion (Patriarch) on Dec 15, 2011 at 19:30 UTC

      Hi Corion,

      True, but...all LWP::RobotUA gets you is a) client side processing of robot rules (i.e., once the user agent has downloaded robots.txt for a site, it will abort a banned url before making the request; and b) an optional, configurable delay between requests so your program can be a good "netizen" and avoid hammering websites too hard. None of this prevents the web server from evaluating your user agent identification string and processing its robot rules to accept or reject your request.

      Cheers,

      Larry

        A webserver in general does not care about robots.txt and does not enforce any of the rules in it. User agent rejection needs to be configured separately for the webserver.
Re^2: LWP::UserAgent Bad and Forbidden requests
by taioba (Acolyte) on Dec 17, 2011 at 16:37 UTC

    Thanks Larry and everybody else for your help. ScienceDirect indeed has an API and RSS feeds, so I guess I'll have to re-hack my code and make sure I go down the legit path. Wish y'all Happy Holidays at the monastery!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://943815]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (2)
As of 2024-04-25 03:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found