Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

WWW:Mechanize bug?

by fraizerangus (Sexton)
on Oct 12, 2011 at 21:40 UTC ( [id://931099]=perlquestion: print w/replies, xml ) Need Help??

fraizerangus has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I'm working with WWW::Mechanize, seems to be the right medicine for what I need but I've already hit a snag on the road; I'm only interested in following the 'motion.cgi' links and extracting these links as text documents however the regex I've used only finds the first 2 links? Anybody any ideas on whats going on?

#!/usr/bin/perl use strict; use WWW::Mechanize; use Storable; my $mech_cgi = WWW::Mechanize->new; $mech_cgi->get( 'http://www.molmovdb.org/cgi-bin/browse.cgi' ); my @cgi_links = $mech_cgi->find_all_links( url_regex => qr/motion.cgi? +/ ); for(my $i = 0; $i < @cgi_links; $i++) { print "following link: ", $cgi_links[$i]->url, "\n"; $mech_cgi->follow_link( url => $cgi_links[$i]->url ) or die "Error following link ", $cgi_links[$i]->url; }
best wishes

Dan

Replies are listed 'Best First'.
Re: WWW:Mechanize bug?
by jethro (Monsignor) on Oct 12, 2011 at 23:10 UTC

    One small bug is that the '?' in your regex is special, meaning 0 or 1 occurence of the previous character, in this case 'i'. You probably want '\?' instead. The same goes for '.'. But that can't be your problem because the regex now is more general than the correct regex would be.

    You could remove the 'url_regex' parameter to test whether you get the links if you don't have any restrictions at all. Then use a url_regex qr/mot/ and slowly add to the regex until your links are not found anymore.

Re: WWW:Mechanize bug?
by Anonymous Monk on Oct 13, 2011 at 03:27 UTC

    the regex I've used only finds the first 2 links? Anybody any ideas on whats going on?

    You're confused about how a browser works

    you get first/url

    you get list of links from first/url

    first link takes you to second/url

    second/url has no more links, especially not the links from first/url, so you can't follow

    Either rewind the browser, or use get, not follow

      Monks

      Thanks so much for the help! I did get it working however it only seems to fetch the first 7 and then the error message appears:

      Internal Server Error at newp line 14

      Using the following code:

      #!/usr/bin/perl use strict; use WWW::Mechanize; use Storable; my $mech_cgi = WWW::Mechanize->new; $mech_cgi->get( 'http://www.molmovdb.org/cgi-bin/browse.cgi' ); my @cgi_links = $mech_cgi->find_all_links( url_regex => qr/motion.cgi/ + ); for(my $i = 0; $i < @cgi_links; $i++) { print "following link: ", $cgi_links[$i]->url, "\n"; $mech_cgi->follow_link( url => $cgi_links[$i]->url ) or die "Error following link ", $cgi_links[$i]->url; $mech_cgi->back; }

      is this a fault with their server or my script?

      many thanks and best wishes

      Dan

        is this a fault with their server or my script?

        Can't say, that error message isn't very informative

        Try

        #!/usr/bin/perl -- use strict; use warnings; use WWW::Mechanize; my $mech_cgi = WWW::Mechanize->new ( autocheck => 1 ); $mech_cgi->show_progress(1); $mech_cgi->get( 'http://www.molmovdb.org/cgi-bin/browse.cgi' ); my @Motion = $mech_cgi->find_all_links( url_regex => qr/motion.cgi/ ); @Motion = map { $_->url_abs() } @Motion; for my $link ( @Motion ){ eval { $mech_cgi->get( $link ); 1; } or warn $@, "\n", $mech_cgi->res->as_string, "\n", '#'x33, "\n\n +"; $mech_cgi->back; } __END__
        And you'll get something more informative
        ** GET http://www.molmovdb.org/cgi-bin/browse.cgi ==> 202 OK ... ** GET http://..../4040404 ==> 404 Not Found Error GETing http://..../4040404: Not Found at somefile.pl line 12 HTTP/1.1 404 Not Found Connection: close Date: Thu, 13 Oct 2011 23:01:51 GMT ... Content-Length: 3942 Content-Type: text/html Client-Date: Thu, 13 Oct 2011 23:05:18 GMT ... Title: blah blah blah <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> ....

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://931099]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (None)
    As of 2024-04-19 00:02 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found