Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Massive Memory Leak

by martin_ldn (Initiate)
on Dec 07, 2009 at 15:58 UTC ( [id://811527]=perlquestion: print w/replies, xml ) Need Help??

martin_ldn has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,

I am new to Perl so I hope this isn't something really simple! It feels like I've been banging my head against a brick wall for hours now and still haven't figured it out. I have written a script to parse thousands of HTML files in a directory. Each file contains a table and I import the table entries into a MySQL database. Functionally it works fine. However, there is some sort of memory problem, and big too! Roughly for every 1000 MySQL queries that are issued a percentage of my memory is taken according to top (I have 2GB RAM). I have used the debugger and tried inserting Dump statements but haven't figured it out yet. Can someone help? The code is below.

Thanks in advance :-)

Martin

#!/usr/bin/perl use HTML::TableContentParser; use HTML::Parse; use HTML::FormatText; use DBI; use strict; use warnings; # Connect to database and create parser object my $db = DBI->connect ("DBI:mysql:newsbms","newsbms", "newsbms", { RaiseError => 1, PrintError => 0}); # Loop twice my $loopround = 1; while ($loopround <= 2) { # Choose the table name my $tablename = "modified"; if ($loopround == 2) { $tablename = "deleted"; } print "\nProcessing the '$tablename' entries...\n\n"; # Create counters to show the number of files and queries processe +d my $counter = 0; my $query_counter = 0; # Open the directory my $dirname = "/home/martin/monitoring/newsBMS/$tablename/"; opendir(DIR, $dirname) || die ("Could not open $dirname"); # Loop through all files in the directory while (defined(my $filename = readdir(DIR))) { # Skip special "files": '.' and '..' next if $filename =~ /^\.\.?$/; $counter++; # Open and read the html file into a single string open(HTMLFILE, $dirname.$filename) || die ("Could not open $fi +lename"); binmode HTMLFILE; my $html = join("", <HTMLFILE>); close(HTMLFILE); # Parse the html tables my $tcp = HTML::TableContentParser->new; my $tables = $tcp->parse($html); # Remove the html tags from the cells for my $t (@$tables) { for my $r (@{ $t->{rows} }) { for my $c (@{ $r->{cells} }) { my $stripper = HTML::FormatText->new; $c->{data} = $stripper->format(parse_html($c->{dat +a})); $c->{data} =~ s/'/-/g; $c->{data} =~ s/[:\\:]/-/g; } } } # Issue the MySQL queries for my $t (@$tables) { for my $r (@{ $t->{rows} }) { my $query = "INSERT INTO"; if ($loopround == 1) { $query = $query . " modified (id, name, title, dur +ation,"; $query = $query . "library, modified, user, rev) V +ALUES ("; } if ($loopround == 2) { $query = $query . " deleted (name, title, duration +,"; $query = $query . "deleted, library) VALUES ("; } for my $c (@{ $r->{cells} }) { chop($c->{data}); # remove the \n $query = $query . "'" . $c->{data} . "',"; } chop($query); # Remove the last comma added $query = $query . ") ON DUPLICATE KEY UPDATE duplicate +s=duplicates+1"; #print "Query = $query \n\n"; my $execute = $db->prepare($query); $execute->execute(); $query_counter++; if ($query_counter % 1000 == 0) { print "Issued $query_counter MySQL queries.\n"; } } } } # Close the directory closedir(DIR); print "\nDone the '$tablename' table.\nProcessed $counter files an +d issued $query_counter MySQL queries.\n"; $loopround++; } # Disconnect from the database $db->disconnect(); print "\nProgram Finished.\n";

Replies are listed 'Best First'.
Re: Massive Memory Leak
by afoken (Chancellor) on Dec 07, 2009 at 16:46 UTC

    I don't see a finish() call for the database statement handle. That could be a problem, even if the handle SHOULD be closed automatically when it runs out of scope.

    And you have a real problem, because you really try hard to avoid DBI placeholders. You are building a gapping security hole called SQL injection, and you sabotage any caching that MySQL may attempt. Build and prepare() the statement once, then call it repeatedly passing all variables to execute(), and finally finish() it.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Massive Memory Leak
by mikelieman (Friar) on Dec 07, 2009 at 16:42 UTC
    That looks like every iteration you're preparing a query.
    Perhaps pre-declaring it and using placeholders will help?
Re: Massive Memory Leak
by graff (Chancellor) on Dec 07, 2009 at 20:02 UTC
    There are a few odd things about the OP code:
    • Using HTML::Parse appears to be deprecated; the perldoc man page starts with:
      Disclaimer: This module is provided only for backwards compatibility with earlier versions of this library. New code should not use this module, and should really use the HTML::Parser and HTML::Treebuilder modules directly, instead.

    • This line in your first nested for loop seems to invoke a subroutine called "parse_html", which I would expect to turn up as "undefined":
      $c->{data} = $stripper->format(parse_html($c->{data}))

    Apart from that, I wouldn't know whether the memory leak is due to the "unfinished" query statement objects, as suggested by others above, or whether it's due to stranded (non-garbage-collected) objects in the HTML parsing/formatting modules.

    One way to tease that apart would be to divide the process into two distinct steps (two processes): step/process 1 is to parse the html data and output tab-delimited flat tables that can then be inserted into your database by any of several easy methods. If that process succeeds, you can conclude that it was the database interaction that caused the leak.

    In any case, a better way to load large quantities into mysql tables is via the "mysqlimport" tool that comes with mysql -- it's incredibly fast compared to using Perl/DBI for inserts, and it's the best/easiest way to load a table from a tab-delimited flat-text file. (Rather, Perl/DBI is incredibly slow relative to mysqlimport.)

    Another idea: since you are looping over two main directories, you might try just doing one directory per run (giving the desired path name on the command line). If you really want to do both in one run (after solving the memory leak issue), a better loop method would be:

    for my $path ( 'modified', 'deleted' ) { ... }
    instead of that clunky while-loop.
Re: Massive Memory Leak
by martin_ldn (Initiate) on Dec 08, 2009 at 10:51 UTC

    Hey all,

    Thanks for the responses :-) I have identified the problem! It was with HTML::Parse. Instead I have used HTML::Strip and this not only fixes the memory leak but is also massively faster.

    Thank you for pointing out the other bits too. As I say I am new to Perl so they are very helpful. I will look into the security problems but as the machine is completely isolated (no network connection) I'm not sure this is an immediate problem. I will bear it in mind though. And graff I like that loop more - looks more elegant!

    Cheers,
    Martin

      Not using placeholders is not only a security problem. When you use placeholders, you allow DBI, DBD::whatever, and the database to cache a parsed form of your query. This can speed up things dramatically, even with simple SQL statements.

      And you can get completely rid of any quoting problems for values you want to pass to the database. Use a placeholder and pass the actual value to execute(), no matter what it contains. You don't even have to know what quoting rules apply to your database.

      Background information:
      For most databases, the DBD can pass SQL statement and values separately to the database, so even the DBD does not need to know quoting rules. The database can cache a precompiled version of the query, and needs to parse the query only once, no matter how often you use it. For those unlucky databases that do not support placeholders, the DBD provides all required quoting rules, and DBI and DBD take care of injecting properly quoted values into the query. At this point, at least DBI and DBD can cache a precompiled version of the query, so DBI and DBD are still more efficient in that worse case than your code. And because a lot of the DBI/DBD code is written in C / XS, it is usually much faster that everything you can code in perl.

      Oh, and by the way: What happens if one of the values you want to insert contains a single quote? Right, your code dies, because you do not quote properly. If you still insist on quoting your values manually, at least use DBIs quote method to quote the values properly.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        There are some fair points there. To get around the ' (or \) problem I simply replaced the character with a - beforehand, which was a workaround. I have now looked at the DBI docs and have modified my program to be much better in terms or architecture and elegance. I have attached it in case it is of help to someone else :-) Check it out!

        #!/usr/bin/perl use HTML::TableContentParser; use HTML::Strip; use DBI; use strict; use warnings; # Connect to database and create parser object my $db = DBI->connect ("DBI:mysql:newsbms","newsbms", "newsbms", { RaiseError => 1, PrintError => 0}); for my $path( 'modified', 'deleted' ) { print "\nProcessing the '$path' entries...\n\n"; # Create counters my $counter = 0; my $query_counter = 0; # Open the directory my $dirname = "/home/martinn/monitoring/newsBMS/$path/"; opendir(DIR, $dirname) || die ("Could not open $dirname"); # Prepare the MySQL statement my $query = "INSERT INTO"; if ($path eq 'modified') { $query = $query . " modified (id, name, title, duration, library, modified, user, rev) VALUES ( ?, ?, ?, ?, ?, ?, ?, ? )"; } if ($path eq 'deleted') { $query = $query . " deleted (name, title, duration, deleted, library) VALUES ( ?, ?, ?, ?, ? )"; } $query = $query . " ON DUPLICATE KEY UPDATE duplicates=duplicates+ +1"; my $statement = $db->prepare($query); # Loop through all files in the directory while (defined(my $filename = readdir(DIR))) { # Skip special "files": '.' and '..' next if $filename =~ /^\.\.?$/; $counter++; # Open and read the html file into a single string open(HTMLFILE, $dirname.$filename) || die ("Couldn't open $fil +ename"); binmode HTMLFILE; my $html = join("", <HTMLFILE>); close(HTMLFILE); # Parse the html table my $tcp = HTML::TableContentParser->new; my $tables = $tcp->parse($html); # Issue the MySQL queries for my $t (@$tables) { for my $r (@{ $t->{rows} }) { my @values; for my $c (@{ $r->{cells} }) { # Remove the html tags from the cells my $stripper = HTML::Strip->new(); $c->{data} = $stripper->parse($c->{data}); # Add cell to the end of the array push(@values, $c->{data}); } $statement->execute(@values); $query_counter++; # Basic activity monitor if ($query_counter % 5000 == 0) { print "Issued $query_counter MySQL queries.\n"; } } } } # Close the directory closedir(DIR); # Finish the MySQL statement $statement->finish(); print "\nDone the '$path' table.\n"; print "Processed $counter files and issued $query_counter MySQL qu +eries.\n"; } # Disconnect from the database $db->disconnect(); print "\nProgram Finished.\n";
Re: Massive Memory Leak
by ikegami (Patriarch) on Dec 08, 2009 at 09:14 UTC
    I see HTML::TreeBuilder objects being created, but I don't see any call to its delete method.
Re: Massive Memory Leak
by weismat (Friar) on Dec 08, 2009 at 08:05 UTC
    Perl's garbage counter can not handle circular references. $c->{data} = $stripper->format(parse_html($c->{data})); could be one part where things go wrong.
    Otherwise I would suggest you to look at some modules like Devel::Leak, Devel::LeakTrace, Devel::Cycle, Object::Destroyer, Devel::Monitor to find the bad boys.
Re: Massive Memory Leak
by weismat (Friar) on Dec 08, 2009 at 08:05 UTC
    double pst - pls delete..

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://811527]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2024-04-25 18:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found