Re: LWP::UserAgent Bad and Forbidden requests

Hi taioba,

You have run afoul of the Robots Exclusion Protocol. Many websites prefer that real humans with real eyeballs to visit their site. Some feel strongly enough to ban software "robots" such as LWP::UserAgent. Sciencedirect.com is one of these. If you look at the robots.txt file for sciencedirect.com, you'll see they only let the big boys (Google, et. al.) spider their site. All others (including you) can go suck rocks. There is no (legit) solution to this problem except to call the webmasters and convince them that it is in their interest to allow your program to crawl their site. Good luck with that. Alternatively, see if the site has an RSS data feed or API that provides the data you need. APIs especially are less subject to interdiction by webmasters, since they are designed for program-to-program integration.

Cheers,

Larry

Comment on Re: LWP::UserAgent Bad and Forbidden requests

Replies are listed 'Best First'.
Re^2: LWP::UserAgent Bad and Forbidden requests by Corion (Patriarch) on Dec 15, 2011 at 19:30 UTC
LWP::UserAgent does not respect `robots.txt`. LWP::RobotUA does.	[reply] [d/l]
Re^3: LWP::UserAgent Bad and Forbidden requests by 1arryb (Acolyte) on Dec 15, 2011 at 19:54 UTC
Hi Corion, True, but...all LWP::RobotUA gets you is a) client side processing of robot rules (i.e., once the user agent has downloaded robots.txt for a site, it will abort a banned url before making the request; and b) an optional, configurable delay between requests so your program can be a good "netizen" and avoid hammering websites too hard. None of this prevents the web server from evaluating your user agent identification string and processing its robot rules to accept or reject your request. Cheers, Larry	[reply]
Re^4: LWP::UserAgent Bad and Forbidden requests by Corion (Patriarch) on Dec 16, 2011 at 07:26 UTC
A webserver in general does not care about robots.txt and does not enforce any of the rules in it. User agent rejection needs to be configured separately for the webserver.	[reply]
Re^2: LWP::UserAgent Bad and Forbidden requests by taioba (Acolyte) on Dec 17, 2011 at 16:37 UTC
Thanks Larry and everybody else for your help. ScienceDirect indeed has an API and RSS feeds, so I guess I'll have to re-hack my code and make sure I go down the legit path. Wish y'all Happy Holidays at the monastery!	[reply]


more useful options
	PerlMonks