Screen scraper

MKevin has asked for the wisdom of the Perl Monks concerning the following question:

I need to create a screen scraper to get data off of a website: Currently I have:


my $index = shift;

my $mech = WWW::Mechanize->new();
$mech->get( "http://vortex.plymouth.edu/uacalplt-u.html" );
my @links = $mech->links;
for my $link ( $mech->links ) {
   # do whatever
}
[download]

I don't know what to do in the what ever part, script wise. I am a beginner, and yes i have picked a tough thing to start on but I need help. Now what I would like to get is the data if you input an $index like KMIA (the identifier); Sounding data (text); 2005 (year); Aug (mo); 24 (day); 0z (hr); parcel; 640x480 (size), that it would go to that link, and take the text of that link and save it into a file named $index_24_0z.dat. For those that are observant KMIA is miami, and 8/34/2005 is the day before Hurricane Katrina. I need help to do this scraper for my research on Hurricane katrina.

Comment on Screen scraper Download Code

Replies are listed 'Best First'.
Re: Screen scraper by shonorio (Hermit) on Jan 10, 2007 at 10:11 UTC
MKevin, You can use a HTTP::Record to help you to create a Mechanize script, as a like Web Testing with HTTP::Recorder article. The code bellow do what you want, and don't forget to print to a file, that I didn't. `use WWW::Mechanize; my $agent = WWW::Mechanize->new(); $agent->get('http://vortex.plymouth.edu/uacalplt-u.html'); $agent->form_number(1); $agent->field('pt', 'parcel'); $agent->field('mm', '08'); $agent->field('dd', '24'); $agent->field('pl', 'none'); $agent->field('size', '640x480'); $agent->field('yy', '05'); $agent->field('id', 'KMIA'); $agent->field('hh', '00'); $agent->click(); print $agent->{content};` [download] Solli Moreira Honorio Sao Paulo - Brazil	[reply] [d/l]
Re: Screen scraper by bart (Canon) on Jan 10, 2007 at 12:07 UTC
In this particular case, you don't need to work this way. I entered your data and ended up with this URL: http://vortex.plymouth.edu/cgi-bin/gen_uacalplt-u.cgi?id=KMIA&pl=none&yy=05&mm=08&dd=24&hh=00&pt=parcel&size=640x480 This produces something that looks like a text file in the browser, as well as in source, but contains 2 lines of extra HTML tags, ending in a "PRE" tag, upfront. Just download this file, for example with LWP, and you're ready to start parsing. `use LWP::Simple; getstore('http://vortex.plymouth.edu/cgi-bin/gen_uacalplt-u.cgi?id=KMI +A&pl=none&yy=05&mm=08&dd=24&hh=00&pt=parcel&size=640x480', 'preKatrin +aData.txt');` [download]	[reply] [d/l]
screen scraper help by MKevin (Novice) on Jan 10, 2007 at 05:50 UTC
OK below is a copy of the code i have so far... I need to know if I am on the right track, and I need to know one last thing: how to extract the data from the site after opening its links and saving it as a .dat file on linux. ------------------------------------------------------------ #!/usr/bin/perl -w use strict; use LWP::Simple; my $index = shift; # ## assuring that the site still exists # my $base = "http://vortex.plymouth.edu/cgi-bin/gen_uacalplt-u.html"; die "Cloudn't get it!" unless define $content; print "Found:\n$content\n"; # ## fetching radiosonde data from web # my @hr = (00, 12) foreach my $hr (@hr) { my ($url) = $content =~ m{ http://vortex\.plymouth\.edu/cgi-bin/gen_uacalplt-u\.cgi?id=${inde +x}&pl=none&yy=05&mm=08&dd=24&hh=${hr}&pt=parcel&size=640x480 }smx; push (@urls, $url) if (defined ($url)); } print "URLs found: @urls\n"; ------------------------------------------------------------ [download] As you can tell from the code: http://vortex.plymouth.edu/uacalplt-u.html (is the base site) From here you type in the data: KMIA (index for radiosonde data for Miami) Sounding data (text) (scroll down) 2005 (year) Aug (mo) 24 (day) 0z (hr) parcel 640x480 (size) to get the folling link: http://vortex.plymouth.edu/cgi-bin/gen_uacalplt-u.cgi?id=KMIA&pl=none&yy=05&mm=08&dd=24&hh=00&pt=parcel&size=640x480 My logic was to open up the base site first, and confirm that it still exists, hence the print. Then using the foreach (to do the 00z and 12z hours) and the command line for the index ($index = shift;) to open up this link. Now I would like to save all the data that is above "Sounding variables and indices" into a text file ( titled "$index_2005_237_$hr.dat". My question is how do I do that. I would greatly appreciate your help... Please email me back as soon as possible.	[reply] [d/l]
Re: screen scraper help by petdance (Parson) on Jan 10, 2007 at 06:07 UTC
You are doing what WWW::Mechanize was designed for. It fetches pages and parses them, so you can easily do something like: `my $mech = WWW::Mechanize->new(); $mech->get( "http://somesite.com" ); my @links = $mech->links; for my $link ( $mech->links ) { # do whatever }` [download] WWW::Mechanize will be your friend. Trust me. xoxo, Andy	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: screen scraper help by diotalevi (Canon) on Jan 10, 2007 at 06:05 UTC
Read the documentation for the module you're already using. `use LWP::Simple 'getstore'; getstore( $url, $filename );` [download] ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊	[reply] [d/l]


laziness, impatience, and hubris
	PerlMonks