Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Screen scraper

by MKevin (Novice)
on Jan 10, 2007 at 08:39 UTC ( [id://593855]=perlquestion: print w/replies, xml ) Need Help??

MKevin has asked for the wisdom of the Perl Monks concerning the following question:

I need to create a screen scraper to get data off of a website: Currently I have:
my $index = shift; my $mech = WWW::Mechanize->new(); $mech->get( "http://vortex.plymouth.edu/uacalplt-u.html" ); my @links = $mech->links; for my $link ( $mech->links ) { # do whatever }
I don't know what to do in the what ever part, script wise. I am a beginner, and yes i have picked a tough thing to start on but I need help. Now what I would like to get is the data if you input an $index like KMIA (the identifier); Sounding data (text); 2005 (year); Aug (mo); 24 (day); 0z (hr); parcel; 640x480 (size), that it would go to that link, and take the text of that link and save it into a file named $index_24_0z.dat. For those that are observant KMIA is miami, and 8/34/2005 is the day before Hurricane Katrina. I need help to do this scraper for my research on Hurricane katrina.

Replies are listed 'Best First'.
Re: Screen scraper
by shonorio (Hermit) on Jan 10, 2007 at 10:11 UTC
    MKevin,

    You can use a HTTP::Record to help you to create a Mechanize script, as a like Web Testing with HTTP::Recorder article.

    The code bellow do what you want, and don't forget to print to a file, that I didn't.

    use WWW::Mechanize; my $agent = WWW::Mechanize->new(); $agent->get('http://vortex.plymouth.edu/uacalplt-u.html'); $agent->form_number(1); $agent->field('pt', 'parcel'); $agent->field('mm', '08'); $agent->field('dd', '24'); $agent->field('pl', 'none'); $agent->field('size', '640x480'); $agent->field('yy', '05'); $agent->field('id', 'KMIA'); $agent->field('hh', '00'); $agent->click(); print $agent->{content};

    Solli Moreira Honorio
    Sao Paulo - Brazil

Re: Screen scraper
by bart (Canon) on Jan 10, 2007 at 12:07 UTC
    In this particular case, you don't need to work this way. I entered your data and ended up with this URL:
    http://vortex.plymouth.edu/cgi-bin/gen_uacalplt-u.cgi?id=KMIA&pl=none&yy=05&mm=08&dd=24&hh=00&pt=parcel&size=640x480
    This produces something that looks like a text file in the browser, as well as in source, but contains 2 lines of extra HTML tags, ending in a "PRE" tag, upfront.

    Just download this file, for example with LWP, and you're ready to start parsing.

    use LWP::Simple; getstore('http://vortex.plymouth.edu/cgi-bin/gen_uacalplt-u.cgi?id=KMI +A&pl=none&yy=05&mm=08&dd=24&hh=00&pt=parcel&size=640x480', 'preKatrin +aData.txt');
screen scraper help
by MKevin (Novice) on Jan 10, 2007 at 05:50 UTC
    OK below is a copy of the code i have so far... I need to know if I am on the right track, and I need to know one last thing: how to extract the data from the site after opening its links and saving it as a .dat file on linux.
    ------------------------------------------------------------ #!/usr/bin/perl -w use strict; use LWP::Simple; my $index = shift; # ## assuring that the site still exists # my $base = "http://vortex.plymouth.edu/cgi-bin/gen_uacalplt-u.html"; die "Cloudn't get it!" unless define $content; print "Found:\n$content\n"; # ## fetching radiosonde data from web # my @hr = (00, 12) foreach my $hr (@hr) { my ($url) = $content =~ m{ http://vortex\.plymouth\.edu/cgi-bin/gen_uacalplt-u\.cgi?id=${inde +x}&pl=none&yy=05&mm=08&dd=24&hh=${hr}&pt=parcel&size=640x480 }smx; push (@urls, $url) if (defined ($url)); } print "URLs found: @urls\n"; ------------------------------------------------------------
    As you can tell from the code: http://vortex.plymouth.edu/uacalplt-u.html (is the base site) From here you type in the data: KMIA (index for radiosonde data for Miami) Sounding data (text) (scroll down) 2005 (year) Aug (mo) 24 (day) 0z (hr) parcel 640x480 (size) to get the folling link: http://vortex.plymouth.edu/cgi-bin/gen_uacalplt-u.cgi?id=KMIA&pl=none&yy=05&mm=08&dd=24&hh=00&pt=parcel&size=640x480 My logic was to open up the base site first, and confirm that it still exists, hence the print. Then using the foreach (to do the 00z and 12z hours) and the command line for the index ($index = shift;) to open up this link. Now I would like to save all the data that is above "Sounding variables and indices" into a text file ( titled "$index_2005_237_$hr.dat". My question is how do I do that. I would greatly appreciate your help... Please email me back as soon as possible.
      You are doing what WWW::Mechanize was designed for. It fetches pages and parses them, so you can easily do something like:
      my $mech = WWW::Mechanize->new(); $mech->get( "http://somesite.com" ); my @links = $mech->links; for my $link ( $mech->links ) { # do whatever }
      WWW::Mechanize will be your friend. Trust me.

      xoxo,
      Andy

      A reply falls below the community's threshold of quality. You may see it by logging in.

      Read the documentation for the module you're already using.

      use LWP::Simple 'getstore'; getstore( $url, $filename );

      ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://593855]
Approved by wazoox
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (2)
As of 2024-04-20 04:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found