Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Passing timeout params through WWW::RobotRules

by trwww (Priest)
on Apr 17, 2010 at 16:12 UTC ( [id://835261]=note: print w/replies, xml ) Need Help??


in reply to Passing timeout params through WWW::RobotRules

EDIT: almut demonstrates above that there is a UA subclass that respects robots.txt files. That is probably what you're looking for.

I've always had a hard time figuring out how you pass values to a module, when your using a module that depend on others that you need to tincker with.

An example is now when im playing around with WWW::RobotRules.

The latter is not an example of the former in this case. WWW::RobotRules does not depend on another module at all. The synopsis just shows one way that a person could fetch the source of a robots.txt file to feed to WWW::RobotRules. To be clear: there is no dependency there. It is an example of one possible way to do it.

This module uses LWP::Simple for its requests (i think)

I am unable to determine what leads you to think this, as it is not the case at all. It is clear in the synopsis that LWP::Simple is used to fetch data to feed to WWW::RobotRules, and that WWW::RobotRules provides no interface to fetch a network based resource for you. Your example is incomplete. You have not shown how you define $robots_txt.

Just to make it clear, I know how to read the docs for LWP::Simple ;), ... its jut a matter of how to I access this layer, since robotrules has precidence.

You've really confused yourself about exactly what your problem is because you think there are relationships in your code when there actually are none. Your real problem is the synopsis of WWW::RobotRules uses LWP::Simple to fetch the robots.txt file, and you don't know how to rewrite that part to enable features of the LWP module that you need. Here is the code that you are looking for:

use LWP::UserAgent; use WWW::RobotRules; my $rules = WWW::RobotRules->new('MOMspider/1.0'); my $ua = LWP::UserAgent->new; $ua->timeout(10); my $robots_url = 'http://some.place/robots.txt'; my $response = $ua->get($robots_url); if ($response->is_success) { my $robots_txt = $response->decoded_content; $rules->parse($robots_url, $robots_txt); if( $rules->allowed($url) ) { ... } } else { die "cant fetch $robots_url: " . $response->status_line; }

Replies are listed 'Best First'.
Re^2: Passing timeout params through WWW::RobotRules
by perlpreben (Beadle) on Apr 17, 2010 at 18:03 UTC
    Ahh, I truly missunderstood then. But that makes everything very clear. I can use mechanize then (since its the one im using on the other parts. Thank you so much for making this clear for me :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://835261]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2024-04-25 19:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found