Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: stuck with WWW::Mechanize drop down list

by spazm (Monk)
on Jun 02, 2012 at 00:29 UTC ( [id://973896]=note: print w/replies, xml ) Need Help??


in reply to stuck with WWW::Mechanize drop down list

The dropdown selector uses javascript to reload the page. It's dorky:
<select id="_size" name="size" onchange="var s=sURL + '&size=' + this.value; document.location.href=s"><option value="all">All</option><option value="50">50</option><option value="100">100</option></select>
We can simulate this by adding an "&size=all" to the url. We'll do this by setting an extra field entry:
$browser->field( 'size', 'all' );
Example:
#!env perl use strict; use warnings; use autodie qw/ open close /; use 5.012; use WWW::Mechanize; # create WWW::Mechanize object # autocheck 1 checks each request to ensure it was successful my $browser = WWW::Mechanize->new( autocheck => [1] ); # retrieve page $browser->get('http://www.ncbi.nlm.nih.gov/Traces/wgs/'); #select form to fill based on mech-dump output $browser->form_number(1); # fill field 'term' with name of species $browser->field( 'term', 'Escherichia' ); $browser->field( 'size', 'all' ); # click apply button $browser->submit('Apply'); my $url = $browser->uri; print "url: $url\n"; # launch browser to test url #system( 'firefox', $url ); print $browser->content();

Replies are listed 'Best First'.
Re^2: stuck with WWW::Mechanize drop down list
by spazm (Monk) on Jun 02, 2012 at 01:02 UTC
    Now that you have the full list, you'd like to follow the link for the "Download as TAB delimited list". In your browser, following the link will lead to a saved file. In the mech, this will be just more content.

    If you want to be clever, you can get the filename from the LWP's HTTP::Response and use it as a filename to dump the file.

    $browser->follow_link( text_regex => qr/Download as TAB/i ); print $browser->content(); # prints TAB delimited file to STDOUT
    $browser->follow_link( text_regex => qr/Download as TAB/i ); if ( my $filename = $browser->res->filename ) { die "file already exists [$filename]" if -e $filename; print STDERR "Saving downloaded file to [$filename]\n"; open my $fh, ">", $filename; print $fh $browser->content; close $fh; }
    #!env perl use strict; use warnings; use autodie qw/ open close /; use 5.012; use WWW::Mechanize; # create WWW::Mechanize object # autocheck 1 checks each request to ensure it was successful my $browser = WWW::Mechanize->new( autocheck => [1] ); # retrieve page $browser->get('http://www.ncbi.nlm.nih.gov/Traces/wgs/'); #select form to fill based on mech-dump output $browser->form_number(1); # fill field 'term' with name of species $browser->field( 'term', 'Escherichia' ); $browser->field( 'size', 'all' ); # click apply button $browser->submit('Apply'); my $url = $browser->uri; print "url: $url\n"; $browser->follow_link( text_regex => qr/Download as TAB/i ); #print $browser->content(); # prints TAB delimited file to STDOUT if ( my $filename = $browser->res->filename ) { die "file already exists [$filename]" if -e $filename; print STDERR "Saving downloaded file to [$filename]\n"; open my $fh, ">", $filename; print $fh $browser->content; close $fh; }

      Spazm, thanks much, especially for the explanations!

      Question. If mech-dump doesn't output content of drop down lists, do I always need to look at the page source and, if so, then add the selection as a 'field' entry?

        I was just about to suggest mech-dump, good that you are already using it!

        Mechanize will only return form elements that are within <form></form> elements.

        The "All" dropdown is not within a set of form tags, it directly triggers javascript to reload the page. In cases like this you just have to figure out what the script is doing and duplicate. Possibly just by inspecting the request URL submitted by the browser.

        This is an area where scraping pages becomes tedious and tricky.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://973896]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2024-04-23 22:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found