Memory Hog Code - Where's the Pileup?

awohld has asked for the wisdom of the Perl Monks concerning the following question:

I wrote this code that goes to 4 different web pages for each of 40 different cities (160 pages in all) and parses the HTML table, stores it into a CSV file, and then injects it into a DB. For some reason I don't know the memory usage keeps on racking up over 500 MB as time goes on.

I guess I'm not deallocating memory somewhere and I'm guessing it has someting to do with open ( my $infile, "/www/cgi-bin/udr/data/$market.$form.csv" ) or die "$market.$form.csv: $!";

I don't think it is a problem since the variable goes out of scope during each successive loop. Am I wrong? Do I need to manually close it? How would you do that for this if it is indeed the probelm?

The total rows of all these tables is over 330,000. So it's a lot of data.

I also know it's bad but I had to comment out 'use Strict', I couldn't figure out how to get it working with Tie::File. I get the error: Bareword "Tie::File" not allowed while "strict subs" in use at ./script.pl

Can anyone see where I'm piling up the memory?

Here's my code:

#!/usr/bin/perl -w

use WWW::Mechanize;
use HTML::TableExtract;
use DBI;
use Text::CSV_XS;
use Tie::File;
use Date::Calc qw( Today Day_of_Week Add_Delta_Days);
use HTML::TokeParser::Simple;
# Commented out strict since errors out with Tie::File
#use strict;

#  Start: Get the names of Cities from a HTML dropdown box on this pag
+e.
#  Store the names of the cities in the @markets array.
my $basePage = 'http://192.168.0.1/';

my $mech = WWW::Mechanize->new();

$mech->get("$basePage");

my $html = $mech->content();

my $tp = HTML::TokeParser::Simple->new(\$html)
    or die "Couldn't parse string: $!";

my ($start, @markets);
while (my $t = $tp->get_token) {
  $start++, next if $t->is_start_tag('select');
  next unless $start;
  last if $t->is_end_tag('/select');
  push @markets, $t->get_attr('value') if $t->is_start_tag('option');
}

#  END: Get the names of Cities from a HTML dropdown box on this page.

my ($year,$month,$day) = Today();

$month = sprintf("%02d", $month);
$day = sprintf("%02d", $day);

#  Form names to submit to HTML form
my @forms = qw( AP SECT CPU FRAME );

# DB Connection Info
my $database = "db";
my $db_server = "localhost";
my $user = "user";
my $password = "pass";

# Connect to database
my $dbh = DBI->connect("DBI:mysql:$database:$db_server",$user,$passwor
+d);


# Start: Each city has a AP, SECT, CPU, and FRAME page.
# Download each page and store it as a CSV file in a sub directory.

foreach my $form (@forms) {
  foreach my $market (@markets) {

    $mech->get("http://192.168.0.1/cgi-bin/getmarket?market=$market");

    $mech->submit_form(
                        fields      => {
                                         table => "$form"
                        }
                      );

    die unless ($mech->success);

    $mech->submit_form( button    => 'action');
    die unless ($mech->success);

    my $html = $mech->content();

    my $te = HTML::TableExtract->new;
    $te->parse($html);

    open(OUT,'>',"/www/cgi-bin/udr/data/$market.$form.csv") || die("Ca
+nnot Open File");

    #  Start: Modify Header row by adding MARKET as the first column.
    #  Take the HTML form and make it a CSV file.
    my $rowNumber = 0;
    foreach my $row ($te->rows) {
      if ( $rowNumber == 0 ) {
        print OUT "MARKET," . join(',', @$row), "\n";
        $rowNumber++;
        }
      print OUT "$market," . join(',', @$row), "\n";
    }
    close OUT;
    #  END: Modify Header row by adding MARKET as the first column.


    # Start: Open the CSV file, if there are duplicate header rows, ma
+ke them not duplicates
    my $csv = Text::CSV_XS->new();

    open ( my $infile, "/www/cgi-bin/udr/data/$market.$form.csv" ) or 
+die "$market.$form.csv: $!";

    my $hdr = $csv->getline( $infile );

    my %seen = ();
    my @newlist = ();

    foreach my $item (@$hdr) {
      if (!$seen{$item}) {
        $seen{$item} = 1;
        push(@newlist, $item);
      } else {
        push(@newlist, "$item" . "_Duplicate_$seen{$item}");
        $seen{$item}++;
      }
    }
    # END: Open the CSV file, if there are duplicate header rows, make
+ them not duplicates

    # Start: DROP the old table from the DB, create a new one, and inj
+ect the CSV file.
    $dbh->do("DROP TABLE $market\_$form");

    my $SQL = "CREATE TABLE $market\_$form (".
       join( " varchar(255),", @newlist ) . " varchar(255))";

    $dbh->do($SQL)
          or die "Die";

    # Delete the header row for injection into DB
    tie my @lines, Tie::File, "/www/cgi-bin/udr/data/$market.$form.csv
+"
      or die "can't update $market.$form.csv: $!";
    shift(@lines);
    shift(@lines);
    untie @lines;

    $SQL = "LOAD DATA LOCAL INFILE '/www/cgi-bin/udr/data/$market.$for
+m.csv' INTO TABLE `$market\_$form` FIELDS TERMINATED BY ',' LINES TER
+MINATED BY '\n' ";
    $dbh->do($SQL)
          or die "Die";

    # END: DROP the old table from the DB, create a new one, and injec
+t the CSV file.

  }

}
[download]

Comment on Memory Hog Code - Where's the Pileup? Select or Download Code

Replies are listed 'Best First'.
Re: Memory Hog Code - Where's the Pileup? by perrin (Chancellor) on Nov 03, 2005 at 05:30 UTC
From the WWW::Mechanize FAQ: Mech is a big memory pig! I'm running out of RAM! Mech keeps a history of every page, and the state it was in. It actually keeps a clone of the full Mech object at every step along the way. You can limit this stack size with the stack_depth parm in the new() constructor.	[reply]
Re^2: Memory Hog Code - Where's the Pileup? by Limbic~Region (Chancellor) on Nov 03, 2005 at 13:46 UTC
perrin, Andy recently made a change to WWW::Mechanize for speed. The change deferred construction of certain objects of a given page until they were asked for. This could have a minor positive memory impact as well. Cheers - L~R	[reply]
Re^3: Memory Hog Code - Where's the Pileup? by perrin (Chancellor) on Nov 03, 2005 at 15:35 UTC
Maybe, but this code already copies the entire page about 4 times in memory ($mech, $html, $te, $te->rows), so any large page is going to take a big hunk of RAM regardless.	[reply]
Re^4: Memory Hog Code - Where's the Pileup? by Limbic~Region (Chancellor) on Nov 03, 2005 at 15:43 UTC
Re^2: Memory Hog Code - Where's the Pileup? by awohld (Hermit) on Nov 03, 2005 at 07:15 UTC
Ahh yes, I remember before having to flush out WWW::Mech manually, there's a method for that. I'll try that and get back to you guys!	[reply]
Re: Memory Hog Code - Where's the Pileup? by dragonchild (Archbishop) on Nov 03, 2005 at 04:03 UTC
Depending on the Perl version, tie can be a memory leak. Since you're not actually using the benefits of Tie::File, why not just strip the headers when you create the file about 20 lines up? Or, am I missing something ... My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^2: Memory Hog Code - Where's the Pileup? by awohld (Hermit) on Nov 03, 2005 at 04:12 UTC
I'm using Perl V 5.8.6, is that a memory leak version for Tie::File? I use the headers after the file is created to create my SQL and look for duplicate headers. Maybe I should rearrange my code so I don't have to use Tie::File.	[reply]
Re: Memory Hog Code - Where's the Pileup? by chas (Priest) on Nov 03, 2005 at 03:23 UTC
I'm not sure about the memory problem, but wouldn't quoting `Tie::File` in the line `tie my @lines, Tie::File, "/www/cgi-bin/udr/data/$market.$form.csv"` solve the bareword problem? chas	[reply] [d/l] [select]
Re: Memory Hog Code - Where's the Pileup? by vek (Prior) on Nov 03, 2005 at 05:16 UTC
I don't think line 97: `open ( my $infile, "/www/cgi-bin/udr/data/$market.$form.csv" ) or die +"$market.$form.csv: $!";` [download] is your problem because that line is only opening a file and creating a filehandle. No actual data from the file has been read at that point. I don't see anything instantly wrong with your code that would cause the memory problems you describe. dragonchild does raise a good point regarding your usage of Tie::File as perhaps being overkill for what you're doing. Try performing the header code yourself and see if that improves memory usage. You'd at least know if you'd been bitten by the `tie` memory leakage problem dragonchild describes. -- vek --	[reply] [d/l] [select]


go ahead... be a heretic
	PerlMonks