Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Lotto table extraction

by LordSoahc (Initiate)
on Jan 24, 2012 at 19:48 UTC ( [id://949750]=perlquestion: print w/replies, xml ) Need Help??

LordSoahc has asked for the wisdom of the Perl Monks concerning the following question:

Writing a script to extract wining lotto #'s from the html web page. however when its time to exract i notice that the page has no headers e.g date, numbers ect. i need assitance with writting this code pls! Here is my code so far.

!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::Parser (); use HTML::TableExtract; $data = get('http://www.flalottery.com/exptkt/c3.htm')

How do i extract the data from the above website if there are no clearly marked headers? i know the usual syntax should look like this

$te = HTML::TableExtract->new( headers => [qw(Date Price Cost)] ); $te->parse($html_string); # Examine all matching tables foreach $ts ($te->tables) { print "Table (", join(',', $ts->coords), "):\n"; foreach $row ($ts->rows) { print join(',', @$row), "\n"; } }

so how am i going to write the code to extract the data from this page when i have no idea if the above page has headers for the numbers or not?

Replies are listed 'Best First'.
Re: Lotto table extraction
by mojotoad (Monsignor) on Jan 24, 2012 at 21:08 UTC
    Those are some pretty nasty tables (50 or so of them). All sorts of empty rows and cells embedded throughout. In cases like these, it's better to extract all tables and filter based on inspecting particular cells. For example:
    #!/usr/bin/perl use strict; use warnings; use LWP::Simple; use HTML::TableExtract; my $data = get('http://www.flalottery.com/exptkt/c3.htm'); my $te = HTML::TableExtract->new; $te->parse($data); for my $t ($te->tables) { my $rc = -1; my($d, $c) = $t->coords; for my $r ($t->rows) { ++$rc; @$r = map { s/^[^a-z0-9]//i; $_ } grep { /[a-z0-9]/i } grep { defined $_ } @$r; next unless @$r && $r->[0] =~ m/^\d+\/\d+\/\d+$/; print "row $d:$c:$rc: ", join(':', @$r), "\n"; } }
    The grep/grep/map part eliminates empty cells and gets rid of the   entities that precede the M/E indicators. The 'next' statement afterwards eliminates empty rows and non-dated rows. This is a shotgun approach. You could easily filter each row using specific column indexes, for example.
Re: Lotto table extraction
by Anonymous Monk on Jan 24, 2012 at 20:27 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://949750]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (3)
As of 2024-03-29 04:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found