Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

HTML::TableExtract - ugly - is there better way?

by rtwolfe (Initiate)
on Apr 09, 2017 at 01:58 UTC ( [id://1187504]=perlquestion: print w/replies, xml ) Need Help??

rtwolfe has asked for the wisdom of the Perl Monks concerning the following question:

Am a beginner so be 'gentle'. Have built a program to pull data from a NASDAQ page for certain stocks. HTML::TableExtract looked like a slick way to go. Nasdaq will tweak this page occasionally and thought by manipulating headers and the count, I could easily keep the script working. And yes, this current script does accomplish my goal. But thought this was a 'teachable moment' to learn more about Perl.

I put this line in a windows batch file to dump the page data to a file for processing.

perl -e "use LWP::Simple; getprint('http://www.nasdaq.com/extended-tra +ding/premarket-mostactive.aspx')" >> nasdaq-stocks.txt
My biggest issue is that when I pull the HTML file data through HTML::TableExtract, there is a lot of clean up work to get the tab delimited format wanted. But to get there, I fell back to writing to files and parsing / substituting rows/lines until I got in the final format.

Here's what the final (good) output looks like.

AMRI $13.53 $14.75 9.02% 6,984 AUPH $6.59 $7.07 7.28% 632,035 ATEC $2.19 $2.30 5.02% 3,880 SBLK $12.13 $12.71 4.78% 10,123 OCLR $8.95 $9.29 3.80% 147,875 FRSH $5.79 $6 3.63% 6,100 KTOS $7.88 $8.16 3.55% 6,901 INCY $135.5 $139.75 3.14% 6,734 TVIX $35.4 $36.45 2.97% 234,847 OSUR $12.3 $12.65 2.85% 4,500

Here's my code - I tried to enter comments to explain what I was thinking. Am hoping there is a cleaner way to use HTML::TableExtract to get real close to the final tab delimtied file.

Assume pulling apart the characters between the open price and change percent, is pretty tricky but can't the rest of the fields get dropped directly to a tab delimited file without the extraneous junk?

use strict; use warnings; use HTML::TableExtract; #Get HTML file and set up headers for HTML::TableExtract my $doc = 'nasdaq-stocks.txt'; my $headers = ['Symbol', 'Last Sale*', 'Change Net / %', 'Share Volume +']; #table 4 is advances. Need to do again for 5 decliners my $table_extract = HTML::TableExtract->new(count => 4, headers => $he +aders); #parse the nasdaq-stocks.txt file and print to outup-temp.txt file #?? found this code. #Is the code below taking HTML loaded in string $table and #breaking into rows to print to a file??? $table_extract->parse_file($doc); my ($table) = $table_extract->tables; open (UPFILE, '>outup-temp.txt'); for my $row ($table->rows) { print UPFILE @$row, "\n"; } close(UPFILE); #tried to add the Substitutes below to the loop above #but failed miserably #.. am taking outup-temp.txt #and load the array @lines for removing junk in the loop below my $filename = 'outup-temp.txt' ; open my $fh , '<' , $filename or die "Cannot read '$filename': $!\n" ; my @lines = <$fh> ; close $fh ; # process the array @lines and remove some of the junk for ( @lines ) { s/^\s+// ; # No need for global substitution s/[\x0A\x0D]{3,}/\t/g; # 3 CR LF become a tab #double tab-change to one tab - never got this to work?? # s/[\x09]{2,}/\t/g; s/\$//g; # Substitute all dollar signs with nothing s/\x20/\t/g; # space becomes a tab # Change chars between open and change pct to tab s/\xC2\xA0\xE2\x96\xB2\xC2\xA0/\t/; } #write cleaned lines to outup-temp.txt open $fh , '>' , $filename or die "Cannot write '$filename': $!\n" ; print $fh @lines ; close $fh ; # now that we have some tab delimiters, use split to break out the # fields and calculate the closing price, then write to file my $stock; my $filler1; my $openpr; my $change; my $pct; my $vol; my $filler2; my $closepr; open (FILE, 'outup-temp.txt'); open STDOUT, '>', "outup.txt"; while (<FILE>) { chomp; ($stock,$filler1,$openpr,$change,$pct,$vol,$filler2) = split("\t") +; #calculate closing price from prior day for advancers $closepr = $openpr-$change; #add back $ signs - print tab delimited fields to file print "$stock\t\$$closepr\t\$$openpr\t$pct\t$vol\n"; } close(FILE); close (STDOUT);

Thanks in advance. Have googled many things from this website that helped to get my kludg-ie code working

Replies are listed 'Best First'.
Re: HTML::TableExtract - ugly - is there better way?
by NetWallah (Canon) on Apr 09, 2017 at 07:27 UTC
    Try this:
    use strict; use warnings; use HTML::TableExtract; #Get HTML file and set up headers for HTML::TableExtract my $doc = 'nasdaq-stocks.txt'; my $html = do{ local $/=undef; open my $f,"<", $doc or die $!;<$f>}; my $headers = ['Symbol', 'Last Sale*', 'Change Net / %', 'Share Volume +']; #table 4 is advances. Need to do again for 5 decliners my $table_extract = HTML::TableExtract->new(count => 4, headers => $he +aders); $table_extract->parse($html); print join (" \t",@$headers),"\n"; for my $r ($table_extract->rows()){ my @cols = map {/([\w\.]+)\W+([\w\.\%]*)/} @$r; print join ("\t",@cols), "\n"; }
    It would take a little work to put the "$" back in front of the "Last Sale*" amount, but this should get you started.

            ...it is unhealthy to remain near things that are in the process of blowing up.     man page for WARP, by Larry Wall

      Thanks NetWallah. Had not heard about Map before. Still a little confusing. Not sure what @$r 'is'. Assume is parsed version of @cols. Need to decode your regex bit by bit. But, thanks so much for fast response.
        This site loves to explain code .. so keep those questions coming.

        @$r is the same as @{ $r } .
        $r is an array-reference ... adding the @{ } around it converts that into an array that can be iterated.

        'map' will "transform" each element of the array, returning a modified array.

        The "transformation" is in the form of a regular-expression - in this case, it extracts "word" type characters (\w), decimals (.) and so on. See perlre.

        The result of this map is stored in the array @cols, which is later printed.

        Hope this helps. If still unclear .. experiment, and come back with more specific questions.

                ...it is unhealthy to remain near things that are in the process of blowing up.     man page for WARP, by Larry Wall

Re: HTML::TableExtract - ugly - is there better way?
by poj (Abbot) on Apr 09, 2017 at 14:52 UTC

    If you know the characters you want, just eliminate everything else.

    #!perl use strict; use warnings; use HTML::TableExtract; use LWP::UserAgent (); my $url = 'http://www.nasdaq.com/extended-trading/premarket-mostactive +.aspx'; my $headers = ['Symbol', 'Last Sale*', 'Change Net / %', 'Share Volume +']; my $ua = LWP::UserAgent->new; $ua->timeout(10); $ua->env_proxy; my $response = $ua->get($url); if ( !$response->is_success) { die $response->status_line; } my $htm = $response->decoded_content; # table4 my $table_extract = HTML::TableExtract->new( count => 4, headers => $headers); my $tbl = $table_extract->parse($htm); my $data = cleanup($tbl); report('Advances',$data); # table5 $table_extract = HTML::TableExtract->new( count => 5, headers => $headers); $tbl = $table_extract->parse($htm); $data = cleanup($tbl); report('Decliners',$data); sub cleanup { my $table = shift; my @data = (); for my $row ($table->rows) { my @clean = map{ s/[^A-Z0-9%,+-\.]/ /g; # allowable s/^ +| +$//g; # trim spaces $_ } @$row; push @data,\@clean; } return \@data; } sub report { my ($title,$data) = @_; print "$title\n"; for (@$data){ my ($stock,$openpr,$tmp,$vol) = @$_; my ($change,$pct) = split / +/,$tmp; my $closepr = $openpr - $change; print join "\t",($stock,'$'.$closepr,'$'.$openpr,$pct,$vol); print "\n"; } }
    poj
      POJ - WOW! Impressed by how quick one gets a response. Thank you for adding code to pull the webpage in the script. Also, need to learn to be 'smarter' about parsing. Never occurred to me to just take out the special characters and replace with space. Using the 'pipe' to drop leading and trailing spaces is cool too. Haven't run into Map before but read a little about it today. Need to understand what @$row is/does. Using subroutines was helpful too. This beginner is very appreciative.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1187504]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-23 20:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found