martin_ldn has asked for the wisdom of the Perl Monks concerning the following question:
Hi there,
I am new to Perl so I hope this isn't something really simple! It feels like I've been banging my head against a brick wall for hours now and still haven't figured it out. I have written a script to parse thousands of HTML files in a directory. Each file contains a table and I import the table entries into a MySQL database. Functionally it works fine. However, there is some sort of memory problem, and big too! Roughly for every 1000 MySQL queries that are issued a percentage of my memory is taken according to top (I have 2GB RAM). I have used the debugger and tried inserting Dump statements but haven't figured it out yet. Can someone help? The code is below.
Thanks in advance :-)
Martin
#!/usr/bin/perl
use HTML::TableContentParser;
use HTML::Parse;
use HTML::FormatText;
use DBI;
use strict;
use warnings;
# Connect to database and create parser object
my $db = DBI->connect ("DBI:mysql:newsbms","newsbms", "newsbms",
{ RaiseError => 1, PrintError => 0});
# Loop twice
my $loopround = 1;
while ($loopround <= 2)
{
# Choose the table name
my $tablename = "modified";
if ($loopround == 2)
{
$tablename = "deleted";
}
print "\nProcessing the '$tablename' entries...\n\n";
# Create counters to show the number of files and queries processe
+d
my $counter = 0;
my $query_counter = 0;
# Open the directory
my $dirname = "/home/martin/monitoring/newsBMS/$tablename/";
opendir(DIR, $dirname) || die ("Could not open $dirname");
# Loop through all files in the directory
while (defined(my $filename = readdir(DIR)))
{
# Skip special "files": '.' and '..'
next if $filename =~ /^\.\.?$/;
$counter++;
# Open and read the html file into a single string
open(HTMLFILE, $dirname.$filename) || die ("Could not open $fi
+lename");
binmode HTMLFILE;
my $html = join("", <HTMLFILE>);
close(HTMLFILE);
# Parse the html tables
my $tcp = HTML::TableContentParser->new;
my $tables = $tcp->parse($html);
# Remove the html tags from the cells
for my $t (@$tables) {
for my $r (@{ $t->{rows} }) {
for my $c (@{ $r->{cells} })
{
my $stripper = HTML::FormatText->new;
$c->{data} = $stripper->format(parse_html($c->{dat
+a}));
$c->{data} =~ s/'/-/g;
$c->{data} =~ s/[:\\:]/-/g;
}
}
}
# Issue the MySQL queries
for my $t (@$tables)
{
for my $r (@{ $t->{rows} })
{
my $query = "INSERT INTO";
if ($loopround == 1)
{
$query = $query . " modified (id, name, title, dur
+ation,";
$query = $query . "library, modified, user, rev) V
+ALUES (";
}
if ($loopround == 2)
{
$query = $query . " deleted (name, title, duration
+,";
$query = $query . "deleted, library) VALUES (";
}
for my $c (@{ $r->{cells} })
{
chop($c->{data}); # remove the \n
$query = $query . "'" . $c->{data} . "',";
}
chop($query); # Remove the last comma added
$query = $query . ") ON DUPLICATE KEY UPDATE duplicate
+s=duplicates+1";
#print "Query = $query \n\n";
my $execute = $db->prepare($query);
$execute->execute();
$query_counter++;
if ($query_counter % 1000 == 0) {
print "Issued $query_counter MySQL queries.\n";
}
}
}
}
# Close the directory
closedir(DIR);
print "\nDone the '$tablename' table.\nProcessed $counter files an
+d issued $query_counter MySQL queries.\n";
$loopround++;
}
# Disconnect from the database
$db->disconnect();
print "\nProgram Finished.\n";
Re: Massive Memory Leak
by afoken (Chancellor) on Dec 07, 2009 at 16:46 UTC
|
I don't see a finish() call for the database statement handle. That could be a problem, even if the handle SHOULD be closed automatically when it runs out of scope.
And you have a real problem, because you really try hard to avoid DBI placeholders. You are building a gapping security hole called SQL injection, and you sabotage any caching that MySQL may attempt. Build and prepare() the statement once, then call it repeatedly passing all variables to execute(), and finally finish() it.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] [d/l] [select] |
Re: Massive Memory Leak
by mikelieman (Friar) on Dec 07, 2009 at 16:42 UTC
|
That looks like every iteration you're preparing a query.
Perhaps pre-declaring it and using placeholders will help? | [reply] |
Re: Massive Memory Leak
by graff (Chancellor) on Dec 07, 2009 at 20:02 UTC
|
There are a few odd things about the OP code:
Apart from that, I wouldn't know whether the memory leak is due to the "unfinished" query statement objects, as suggested by others above, or whether it's due to stranded (non-garbage-collected) objects in the HTML parsing/formatting modules.
One way to tease that apart would be to divide the process into two distinct steps (two processes): step/process 1 is to parse the html data and output tab-delimited flat tables that can then be inserted into your database by any of several easy methods. If that process succeeds, you can conclude that it was the database interaction that caused the leak.
In any case, a better way to load large quantities into mysql tables is via the "mysqlimport" tool that comes with mysql -- it's incredibly fast compared to using Perl/DBI for inserts, and it's the best/easiest way to load a table from a tab-delimited flat-text file. (Rather, Perl/DBI is incredibly slow relative to mysqlimport.)
Another idea: since you are looping over two main directories, you might try just doing one directory per run (giving the desired path name on the command line). If you really want to do both in one run (after solving the memory leak issue), a better loop method would be:
for my $path ( 'modified', 'deleted' ) {
...
}
instead of that clunky while-loop. | [reply] [d/l] [select] |
Re: Massive Memory Leak
by martin_ldn (Initiate) on Dec 08, 2009 at 10:51 UTC
|
Hey all,
Thanks for the responses :-) I have identified the problem! It was with HTML::Parse. Instead I have used HTML::Strip and this not only fixes the memory leak but is also massively faster.
Thank you for pointing out the other bits too. As I say I am new to Perl so they are very helpful. I will look into the security problems but as the machine is completely isolated (no network connection) I'm not sure this is an immediate problem. I will bear it in mind though. And graff I like that loop more - looks more elegant!
Cheers,
Martin
| [reply] |
|
Not using placeholders is not only a security problem. When you use placeholders, you allow DBI, DBD::whatever, and the database to cache a parsed form of your query. This can speed up things dramatically, even with simple SQL statements.
And you can get completely rid of any quoting problems for values you want to pass to the database. Use a placeholder and pass the actual value to execute(), no matter what it contains. You don't even have to know what quoting rules apply to your database.
- Background information:
- For most databases, the DBD can pass SQL statement and values separately to the database, so even the DBD does not need to know quoting rules. The database can cache a precompiled version of the query, and needs to parse the query only once, no matter how often you use it. For those unlucky databases that do not support placeholders, the DBD provides all required quoting rules, and DBI and DBD take care of injecting properly quoted values into the query. At this point, at least DBI and DBD can cache a precompiled version of the query, so DBI and DBD are still more efficient in that worse case than your code. And because a lot of the DBI/DBD code is written in C / XS, it is usually much faster that everything you can code in perl.
Oh, and by the way: What happens if one of the values you want to insert contains a single quote? Right, your code dies, because you do not quote properly. If you still insist on quoting your values manually, at least use DBIs quote method to quote the values properly.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] [d/l] |
|
There are some fair points there. To get around the ' (or \) problem I simply replaced the character with a - beforehand, which was a workaround. I have now looked at the DBI docs and have modified my program to be much better in terms or architecture and elegance. I have attached it in case it is of help to someone else :-) Check it out!
#!/usr/bin/perl
use HTML::TableContentParser;
use HTML::Strip;
use DBI;
use strict;
use warnings;
# Connect to database and create parser object
my $db = DBI->connect ("DBI:mysql:newsbms","newsbms", "newsbms",
{ RaiseError => 1, PrintError => 0});
for my $path( 'modified', 'deleted' )
{
print "\nProcessing the '$path' entries...\n\n";
# Create counters
my $counter = 0;
my $query_counter = 0;
# Open the directory
my $dirname = "/home/martinn/monitoring/newsBMS/$path/";
opendir(DIR, $dirname) || die ("Could not open $dirname");
# Prepare the MySQL statement
my $query = "INSERT INTO";
if ($path eq 'modified')
{
$query = $query . " modified (id, name, title, duration,
library, modified, user, rev) VALUES (
?, ?, ?, ?, ?, ?, ?, ? )";
}
if ($path eq 'deleted')
{
$query = $query . " deleted (name, title, duration,
deleted, library) VALUES (
?, ?, ?, ?, ? )";
}
$query = $query . " ON DUPLICATE KEY UPDATE duplicates=duplicates+
+1";
my $statement = $db->prepare($query);
# Loop through all files in the directory
while (defined(my $filename = readdir(DIR)))
{
# Skip special "files": '.' and '..'
next if $filename =~ /^\.\.?$/;
$counter++;
# Open and read the html file into a single string
open(HTMLFILE, $dirname.$filename) || die ("Couldn't open $fil
+ename");
binmode HTMLFILE;
my $html = join("", <HTMLFILE>);
close(HTMLFILE);
# Parse the html table
my $tcp = HTML::TableContentParser->new;
my $tables = $tcp->parse($html);
# Issue the MySQL queries
for my $t (@$tables)
{
for my $r (@{ $t->{rows} })
{
my @values;
for my $c (@{ $r->{cells} })
{
# Remove the html tags from the cells
my $stripper = HTML::Strip->new();
$c->{data} = $stripper->parse($c->{data});
# Add cell to the end of the array
push(@values, $c->{data});
}
$statement->execute(@values);
$query_counter++;
# Basic activity monitor
if ($query_counter % 5000 == 0)
{
print "Issued $query_counter MySQL queries.\n";
}
}
}
}
# Close the directory
closedir(DIR);
# Finish the MySQL statement
$statement->finish();
print "\nDone the '$path' table.\n";
print "Processed $counter files and issued $query_counter MySQL qu
+eries.\n";
}
# Disconnect from the database
$db->disconnect();
print "\nProgram Finished.\n";
| [reply] [d/l] |
Re: Massive Memory Leak
by ikegami (Patriarch) on Dec 08, 2009 at 09:14 UTC
|
I see HTML::TreeBuilder objects being created, but I don't see any call to its delete method. | [reply] [d/l] |
Re: Massive Memory Leak
by weismat (Friar) on Dec 08, 2009 at 08:05 UTC
|
Perl's garbage counter can not handle circular references.
$c->{data} = $stripper->format(parse_html($c->{data})); could be one part where things go wrong.
Otherwise I would suggest you to look at some modules like Devel::Leak, Devel::LeakTrace, Devel::Cycle, Object::Destroyer, Devel::Monitor to find the bad boys.
| [reply] [d/l] |
Re: Massive Memory Leak
by weismat (Friar) on Dec 08, 2009 at 08:05 UTC
|
double pst - pls delete..
| [reply] |
|
|