Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Sucking Data off a Web Page

by awohld (Hermit)
on Oct 10, 2004 at 05:19 UTC ( [id://397956]=perlquestion: print w/replies, xml ) Need Help??

awohld has asked for the wisdom of the Perl Monks concerning the following question:

There's a site: http://www.speedtrap.org/speedtraps/ste.asp?state=AK&city=all that I want to get the speed trap info off of, it is in the form of an HTML table:

<table border="1" width="100%" cellpadding="1" cellspacing="1"><br> <tr> <td width="34%" bgcolor="#CCFFFF"><font face="Arial" size="2 +">Jurisdiction (city, county, etc.):</font></td> <td width="66%" bgcolor="#FFFFFF"><font size="5" face="Arial +"><b><font size="3">Anchorage, Alaska</font></b></font></td> </tr> <tr> <td width="34%" bgcolor="#CCFFFF"><font face="Arial" size="2 +">Speed Trap Location:</font></td> <td width="66%" bgcolor="#FFFFFF"><font face="Arial" size="- +1">Minnesota Parkway</font></td> </tr> <tr> <td width="34%" bgcolor="#CCFFFF"><font face="Arial" size="2 +">Nearest Reference Point:</font></td> <td width="66%" bgcolor="#FFFFFF"><font face="Arial" size="- +1">12th Avenue</font></td> </tr> <tr> <td width="34%" bgcolor="#CCFFFF"><font face="Arial" size="2 +">GPS Coordinates:</font></td> <td width="66%" bgcolor="#FFFFFF"> n/a </td> </tr> <tr> <td width="34%" bgcolor="#CCFFFF"><font face="Arial" size="2 +">Time of Day:</font></td> <td width="66%" bgcolor="#FFFFFF"><font face="Arial" size="- +1">Any time of day</font></td> </tr> <tr> <td width="34%" bgcolor="#CCFFFF"><font face="Arial" size="2 +">Level of Enforcement:</font></td> <td width="66%" bgcolor="#FFFFFF"><font face="Arial" size="- +1">Some</font></td> </tr> <tr> <td width="34%" bgcolor="#CCFFFF"><font face="Arial" size="2 +">Type of Enforcement:</font></td> <td width="66%" bgcolor="#FFFFFF"><font face="Arial" size="- +1">Radar</font></td> </tr> <tr> <td width="34%" bgcolor="#CCFFFF"><font face="Arial" size="2 +">Date:</font></td> <td width="66%" bgcolor="#FFFFFF"><font face="Arial" size="- +1">4/2004</font></td> </tr> <tr bgcolor="#FFFFFF"> <td colspan="2">Before entering downtown on Minnesota, an of +ficer will be waiting on 11th ave. when you come around the corner in +to a 25mph speedlimit from a 45mph.This is the place I received my on +ly ticket in five years and it was for 21mph over. </td> </tr> <tr> <td colspan="2" bgcolor="#E6ECF0"><a href="comments.asp?stat +e=AK&city=all&st=20945"><font color="#2F2CFF">Click here to view the +1 comment about this speed trap.</font></a></td> </tr> <tr> <td colspan="2" bgcolor="#E6ECF0"> <a href="comment.asp?state=AK&city=all&st=20945"><font + color="#2F2CFF">Agree/Disagree? Add your comment.</font></a> </td> </tr> </table>

There are many records on many different pages that I want to archive. I want automate a script to put all of the records on a page into a comma delimited text file.

I don't know where to start. Can someone give me some direction on this?

Thanks
Adam

20041010 Janitored by Corion: Put Table HTML into CODE tags

janitored by ybiC: Balanced <readmore> tags around codeblock

Replies are listed 'Best First'.
Re: Sucking Data off a Web Page
by Corion (Patriarch) on Oct 10, 2004 at 08:55 UTC

    jeffa mojotoad wrote a really great scraping module, HTML::TableExtract, which easily scrapes an HTML table into an array of arrays, which you can then convert to a csv file again, or stuff it into DBI directly. For example, the following code tries to extract all rows from "the one table" on the page:

    my $te = HTML::TableExtract->new(); $te->parse($html); foreach $row ($te->rows) { print join(',', @$row), "\n"; }

    The only problem there is with your table is, that it is not organized in columns but in rows, so you will have to flip the table.

    Update: I realized that it was mojotoad, not jeffa who wrote HTML::TableExtract.

      I'm just wondering if that module can handle colspan and rowspan cells well... I'm not saying that I think it's a bad module if it can not. Rather will I think it is pretty good if it can.




      "2b"||!"2b";$$_="the question"
      Besides that, my code is untested unless stated otherwise.
      One more: please review the article about regular expressions (do's and don'ts) I'm working on.

        Just to resolve doubts, HTML::TableExtract does handle columnspan/rowspan correctly. Quoting the POD:

        Furthermore, TableExtract will automatically compensate for cell span issues so that columns are really the same columns as you would visually see in a browser.
Re: Sucking Data off a Web Page
by TheEnigma (Pilgrim) on Oct 10, 2004 at 05:32 UTC
    LWP::UserAgent, LWP::Simple and WWW::Mechanize are probably three good modules on CPAN that can get you started scaping data. Read the docs on them and see which one works best for your application. CPAN has many more besides those, too; in case those don't do what you want.

    TheEnigma

Re: Sucking Data off a Web Page
by tachyon (Chancellor) on Oct 10, 2004 at 06:14 UTC

    Note that this may be against their terms of service and that you may cripple their server if you are not careful to limit your request rate. For good luck here are 7 lines to get you started.

    use LWP::Simple; my $data = get( "http://www.speedtrap.org/speedtraps/ste.asp?city=all& +state=AK" ); for my $chunk ( split /<table border="1" width="100%"/, $data ) { next unless $chunk =~ m/Jurisdiction/; @data = $chunk =~ m!<td width="66%" bgcolor="#FFFFFF">\s*(.*?)\s*< +/td>!gs; my $csv = join ',', map{ s!</?[^>]+>!!g; s!"!\\"!g; qq!"$_"! } @da +ta; print "$csv\n"; }

    cheers

    tachyon

Re: Sucking Data off a Web Page
by Fletch (Bishop) on Oct 10, 2004 at 23:11 UTC
Re: Sucking Data off a Web Page
by DrHyde (Prior) on Oct 11, 2004 at 08:36 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://397956]
Approved by Zaxo
Front-paged by NetWallah
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2024-04-16 14:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found