Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Memory Hog Code - Where's the Pileup?

by awohld (Hermit)
on Nov 03, 2005 at 03:05 UTC ( [id://505204]=perlquestion: print w/replies, xml ) Need Help??

awohld has asked for the wisdom of the Perl Monks concerning the following question:

I wrote this code that goes to 4 different web pages for each of 40 different cities (160 pages in all) and parses the HTML table, stores it into a CSV file, and then injects it into a DB. For some reason I don't know the memory usage keeps on racking up over 500 MB as time goes on.

I guess I'm not deallocating memory somewhere and I'm guessing it has someting to do with     open ( my $infile, "/www/cgi-bin/udr/data/$market.$form.csv" ) or die "$market.$form.csv: $!";

I don't think it is a problem since the variable goes out of scope during each successive loop. Am I wrong? Do I need to manually close it? How would you do that for this if it is indeed the probelm?

The total rows of all these tables is over 330,000. So it's a lot of data.

I also know it's bad but I had to comment out 'use Strict', I couldn't figure out how to get it working with Tie::File. I get the error: Bareword "Tie::File" not allowed while "strict subs" in use at ./script.pl

Can anyone see where I'm piling up the memory?

Here's my code:
#!/usr/bin/perl -w use WWW::Mechanize; use HTML::TableExtract; use DBI; use Text::CSV_XS; use Tie::File; use Date::Calc qw( Today Day_of_Week Add_Delta_Days); use HTML::TokeParser::Simple; # Commented out strict since errors out with Tie::File #use strict; # Start: Get the names of Cities from a HTML dropdown box on this pag +e. # Store the names of the cities in the @markets array. my $basePage = 'http://192.168.0.1/'; my $mech = WWW::Mechanize->new(); $mech->get("$basePage"); my $html = $mech->content(); my $tp = HTML::TokeParser::Simple->new(\$html) or die "Couldn't parse string: $!"; my ($start, @markets); while (my $t = $tp->get_token) { $start++, next if $t->is_start_tag('select'); next unless $start; last if $t->is_end_tag('/select'); push @markets, $t->get_attr('value') if $t->is_start_tag('option'); } # END: Get the names of Cities from a HTML dropdown box on this page. my ($year,$month,$day) = Today(); $month = sprintf("%02d", $month); $day = sprintf("%02d", $day); # Form names to submit to HTML form my @forms = qw( AP SECT CPU FRAME ); # DB Connection Info my $database = "db"; my $db_server = "localhost"; my $user = "user"; my $password = "pass"; # Connect to database my $dbh = DBI->connect("DBI:mysql:$database:$db_server",$user,$passwor +d); # Start: Each city has a AP, SECT, CPU, and FRAME page. # Download each page and store it as a CSV file in a sub directory. foreach my $form (@forms) { foreach my $market (@markets) { $mech->get("http://192.168.0.1/cgi-bin/getmarket?market=$market"); $mech->submit_form( fields => { table => "$form" } ); die unless ($mech->success); $mech->submit_form( button => 'action'); die unless ($mech->success); my $html = $mech->content(); my $te = HTML::TableExtract->new; $te->parse($html); open(OUT,'>',"/www/cgi-bin/udr/data/$market.$form.csv") || die("Ca +nnot Open File"); # Start: Modify Header row by adding MARKET as the first column. # Take the HTML form and make it a CSV file. my $rowNumber = 0; foreach my $row ($te->rows) { if ( $rowNumber == 0 ) { print OUT "MARKET," . join(',', @$row), "\n"; $rowNumber++; } print OUT "$market," . join(',', @$row), "\n"; } close OUT; # END: Modify Header row by adding MARKET as the first column. # Start: Open the CSV file, if there are duplicate header rows, ma +ke them not duplicates my $csv = Text::CSV_XS->new(); open ( my $infile, "/www/cgi-bin/udr/data/$market.$form.csv" ) or +die "$market.$form.csv: $!"; my $hdr = $csv->getline( $infile ); my %seen = (); my @newlist = (); foreach my $item (@$hdr) { if (!$seen{$item}) { $seen{$item} = 1; push(@newlist, $item); } else { push(@newlist, "$item" . "_Duplicate_$seen{$item}"); $seen{$item}++; } } # END: Open the CSV file, if there are duplicate header rows, make + them not duplicates # Start: DROP the old table from the DB, create a new one, and inj +ect the CSV file. $dbh->do("DROP TABLE $market\_$form"); my $SQL = "CREATE TABLE $market\_$form (". join( " varchar(255),", @newlist ) . " varchar(255))"; $dbh->do($SQL) or die "Die"; # Delete the header row for injection into DB tie my @lines, Tie::File, "/www/cgi-bin/udr/data/$market.$form.csv +" or die "can't update $market.$form.csv: $!"; shift(@lines); shift(@lines); untie @lines; $SQL = "LOAD DATA LOCAL INFILE '/www/cgi-bin/udr/data/$market.$for +m.csv' INTO TABLE `$market\_$form` FIELDS TERMINATED BY ',' LINES TER +MINATED BY '\n' "; $dbh->do($SQL) or die "Die"; # END: DROP the old table from the DB, create a new one, and injec +t the CSV file. } }

Replies are listed 'Best First'.
Re: Memory Hog Code - Where's the Pileup?
by perrin (Chancellor) on Nov 03, 2005 at 05:30 UTC
    From the WWW::Mechanize FAQ:

    Mech is a big memory pig! I'm running out of RAM!

    Mech keeps a history of every page, and the state it was in. It actually keeps a clone of the full Mech object at every step along the way.

    You can limit this stack size with the stack_depth parm in the new() constructor.

      perrin,
      Andy recently made a change to WWW::Mechanize for speed. The change deferred construction of certain objects of a given page until they were asked for. This could have a minor positive memory impact as well.

      Cheers - L~R

        Maybe, but this code already copies the entire page about 4 times in memory ($mech, $html, $te, $te->rows), so any large page is going to take a big hunk of RAM regardless.
      Ahh yes, I remember before having to flush out WWW::Mech manually, there's a method for that. I'll try that and get back to you guys!
Re: Memory Hog Code - Where's the Pileup?
by dragonchild (Archbishop) on Nov 03, 2005 at 04:03 UTC
    Depending on the Perl version, tie can be a memory leak. Since you're not actually using the benefits of Tie::File, why not just strip the headers when you create the file about 20 lines up? Or, am I missing something ...

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      I'm using Perl V 5.8.6, is that a memory leak version for Tie::File?

      I use the headers after the file is created to create my SQL and look for duplicate headers. Maybe I should rearrange my code so I don't have to use Tie::File.
Re: Memory Hog Code - Where's the Pileup?
by chas (Priest) on Nov 03, 2005 at 03:23 UTC
    I'm not sure about the memory problem, but wouldn't quoting Tie::File in the line
    tie my @lines, Tie::File, "/www/cgi-bin/udr/data/$market.$form.csv"
    solve the bareword problem?
    chas
Re: Memory Hog Code - Where's the Pileup?
by vek (Prior) on Nov 03, 2005 at 05:16 UTC

    I don't think line 97:

    open ( my $infile, "/www/cgi-bin/udr/data/$market.$form.csv" ) or die +"$market.$form.csv: $!";
    is your problem because that line is only opening a file and creating a filehandle. No actual data from the file has been read at that point.

    I don't see anything instantly wrong with your code that would cause the memory problems you describe. dragonchild does raise a good point regarding your usage of Tie::File as perhaps being overkill for what you're doing. Try performing the header code yourself and see if that improves memory usage. You'd at least know if you'd been bitten by the tie memory leakage problem dragonchild describes.

    -- vek --

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://505204]
Approved by Zaxo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-25 22:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found