Optimizing mysql and inserts

gossamer has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm a relative newbie with perl. I've managed to create a script that reads from a few DB_FILE hashes and inserts its contents into a mysql database.

Problem is that it's very slow. There's probably 650k records in the DB_FILE, and it takes many hours to finish. I can see a few fundamental ways to optimize this, but none that are going to really speed it up. I'm not even sure where the real bottlenecks are, because the processors are basically idle, the box has 8GB of RAM that's mostly unused, and the disks are also mostly idle. The box is otherwise not doing anything.

The script is designed to rsync the hashes from two remote mail hosts which write to the DB_FILES the output from amavisd and spamassassin to build a quarantine database.

Can I separate the script into two, so the DB_FILE databases are written to mysql simultaneously? Will it automatically figure out the potential locking issues doing that?

Can someone recommend or provide some ideas for optimizing the inserts into mysql with my script?

#!/usr/bin/perl -w

  # use perl;
  use DB_File;
  use DBI;
  use File::Basename    qw(basename);
  use strict;
  use vars              qw($verbose);

  #
  # Prototypes
  #
  sub DBG($);

  my $me    = basename($0); $me =~ s/\.pl$//;
  my $basedir    = '/var/www/myhost.com/';
  my $lockfile    = '/var/run/db-fetch-and-merge.lock';
  my $key    = $basedir . '/keys/mail-sync-key-2012';
  my $user    = 'amavis';
  my $t_timeout    = 5 * 60;
  my $rc    = 0;
  my $cnt    = 0;
  my $steps    = 100;
  $verbose    = shift || 1; 

  my $db    = {
    'user'    => 'quarantine',
    'pass'    => 'my%quar',
    'host'    => 'localhost',
    'db'    => 'quarantine',
  };

  #
  # Retrieve the quarantine database files from the two mail hosts.
  #
  foreach my $host (qw(mail01 mail02)) {
    print "rsyncing: $host\n";
    my $cmd    = sprintf("/usr/bin/rsync -avq -e 'ssh -c blowfish -l %
+s -i %s' --delete %s %s",
    $user,
    $key,
    $host . '.myhost.com:/var/spool/amavisd/qdb/*.db',
    $basedir . '/' . $host . '/'
    );

    if (system($cmd) != 0) {
      print STDERR "An error occured while syncing $host: $!\n"; $rc =
+ 1; goto done;
    }
  }

  #
  # Connect to the MySQL database and empty out the 'entry' table.
  #
  my $dbh    = DBI->connect("DBI:mysql:database=$db->{'db'};host=$db->
+{'host'}",
            $db->{'user'}, $db->{'pass'}, { 'PrintError' => 1, });

  $dbh->do('DELETE FROM entry');

  #
  # XXX
  #
  foreach my $host (qw(mail01 mail02)) {
    my $src    = $basedir . '/' . $host;
    print "processing: $src\n";

    for (my $i = 0; $i < 256; $i++) {
      my $file    = sprintf('%s/%02x.db', $src, $i);
      my %hash;

      eval {
        $SIG{'ALRM'}    = sub { die "alarm\n" }; alarm($t_timeout);

        unless (tie (%hash, 'DB_File', $file, O_RDONLY, 0600)) {
          print STDERR "Can't open db '$file': $!\n"; $rc = 1; next;
        }

        foreach my $filename (keys %hash) {
          next unless $hash{$filename};

          my @tmp    = split /\t/, $hash{$filename}, 7;
          my $type    = $tmp[0];
          my $dt    = $tmp[1];
          my $size    = $tmp[2];
          my $from    = $tmp[3];
          my $to    = $tmp[4];
          my $subj    = $tmp[5];
          my $score    = $tmp[6];

      if ($type eq "virus") {
        $score = 100;
      }

          my $datestamp = sprintf('%04d-%02d-%02d %02d:%02d:%02d',
             ($dt =~ m/^([0-9]{4})([0-9]{2})([0-9]{2})T([0-9]{2})([0-9
+]{2})([0-9]{2}).*$/));

      if($host eq '') { $host = "NOTDEFINED"; };
      if($filename eq '') { $filename = "NOTDEFINED"; };
      if($datestamp eq '') { $datestamp = "NOTDEFINED"; };
      if($type eq '') { $type = "NOTDEFINED"; };
      if($size eq '') { $size = "NOTDEFINED"; };
      if($from eq '') { $from = "NOTDEFINED"; };
      if($to eq '') { $to = "NOTDEFINED"; };
      if($subj eq '') { $subj = "NOTDEFINED"; };
      if($score eq '') { $score = "NOTDEFINED"; };

      printf("|%.15s|%.20s|%.20s|%.04s|%0.04s|%0.10s|%0.10s|%0.20s|%0.
+05s|\n",$host,$filename,$datestamp,$type,$size,$from,$to,$subj,$score
+);
          my $sql    = sprintf('INSERT INTO entry VALUES(%s, %s, %s, %
+s, %s, %s, %s, %s, %s)',
        $dbh->quote($host),
        $dbh->quote($filename),
        $dbh->quote($datestamp),
        $dbh->quote($type),
        $dbh->quote($size),
        $dbh->quote($from),
        $dbh->quote($to),
        $dbh->quote($subj),
        $dbh->quote($score)
          );

          $dbh->do($sql);

          if (($cnt % $steps) == 0) { DBG(sprintf("\e[8D%-8d\n", $cnt)
+); } $cnt++;

        } # end inside foreach

    print "\n";
        untie %hash;

        alarm(0);
      };

      if ($@ eq "alarm\n") { 
        print STDERR "Can't open db '$file' (timeout): $!\n"; $rc = 1;
+ next;
      }

    } # end for hash for each host
  } # end foreach

  #
  # Disconnect from the database, remove our lock file, and exit quiet
+ly.
  #
done:
  $dbh->disconnect() if ($dbh);
  unlink $lockfile;
  exit 0;

######################################################################
+##########
sub DBG($) { my $msg = shift; print $msg; }
[download]

Comment on Optimizing mysql and inserts Download Code

Replies are listed 'Best First'.
Re: Optimizing mysql and inserts by davido (Cardinal) on Jul 11, 2012 at 17:23 UTC
You ought to start with Devel::NYTProf to identify that it is actually your inserts that are chewing up so much time. You're probably correct, but it's good to know for sure before investing time in optimizing. However, there is one thing that stands out for me: This snippet: `my $sql = sprintf('INSERT INTO entry VALUES(%s, %s, %s, %s, %s, %s, + %s, %s, %s)', $dbh->quote($host), $dbh->quote($filename), $dbh->quote($datestamp), $dbh->quote($type), $dbh->quote($size), $dbh->quote($from), $dbh->quote($to), $dbh->quote($subj), $dbh->quote($score) );` [download] ...could be more easily (and no less efficiently) written like this: `my $sth = $dbh->prepare( 'INSERT INTO entry VALUES(?,?,?,?,?,?,?,?,?) +' ); $sth->execute( $host, $filename, $datestamp, $type, $size, $from, $to, $sub, $score );` [download] But now for an efficiency improvement: Since this method separates the preparation of the SQL from the binding of the values, you can move the prepare() statement outside of the loop. Do the prepare(...) before you start looping, and then inside the loop just keep calling `$sth->execute()` as many times as needed. The reason this will be more efficient is because your SQL query won't have to be re-compiled by the server every time you do an insert; it will compile once and reuse over and over again with each set of bind-values. There are other techniques too. For example, MySQL has a "`LOAD DATA INFILE`" statement that can pull your data quickly from a file. This should be significantly faster, with one big caveat: If your data isn't already in a file, it could be more expensive to pull all your data into a file first. In your case, that seems to be the case: You're not pulling from a file, but rather, from another database. It would probably not be worth spilling into an output file unless your source database also has a super-fast means of bulk dumping (something like "SELECT INTO OUTFILE"). Even then, you would have to benchmark to see if there is enough of a savings to be worthwhile. Another technique is to tell MySQL to delay updating of its indices until after all of the inserts. But now we're getting into MySQL optimization, not Perl optimization. I recently read the O'Reilly book High Performance MySQL, and found it to be a nice resource. If you'll be doing more of this in the future, you might pick up a copy. Dave	[reply] [d/l] [select]
Re: Optimizing mysql and inserts by ruzam (Curate) on Jul 11, 2012 at 18:24 UTC
I regularly run scripts to rebuild MySQL tables from scratch and reload them with data records. A quick line count tells me about 1.2 million text records in various combinations of insert/delete/update. The text records are parsed, and submitted as MySQL requests by Perl, one by one through DBI connection without the benefit of prepared statements. It takes roughly 10 minutes to completely process. This is running on my desktop box, nothing fancy. So unless you have a seriously under powered computer (I doubt it) or a network storage limitation or database locking issues, I doubt that MySQL is your bottleneck. You need to find out where the code is spending it's time. A few well placed print statements should point you in the right direction fairly quickly.	[reply]
Re^2: Optimizing mysql and inserts by gossamer (Sexton) on Jul 11, 2012 at 19:51 UTC
I've commented out the "$dbh->do($sql);" in my code, and it's substantially faster. That is definitely the bottleneck. ...few minutes later, after a few trial runs... I grabbed a my.cnf from a larger server and put it in place of the existing one, and it's magnitudes faster! Turns out it was in how mysql was configured. I think it's much better now. I'm actually pretty proud of myself for this, and thanks for everyone's help as well! Thanks, Alex	[reply]
Re^3: Optimizing mysql and inserts by huck (Prior) on Mar 15, 2017 at 02:47 UTC
I am surprised nobody ever mentioned `$rc = $dbh->begin_work; $rc = $dbh->commit;` [download] http://search.cpan.org/~timb/DBI-1.636/DBI.pm#commit Any time you work with more than a tiny few records you want to use begin_work and commit yea i know its old, and you probably know this by now	[reply] [d/l]
Re^3: Optimizing mysql and inserts by ruzam (Curate) on Jul 11, 2012 at 22:10 UTC
Not that it's Perl related, but what kind of differences did you find in the my.cnf file?	[reply]
Re: Optimizing mysql and inserts by onelesd (Pilgrim) on Jul 11, 2012 at 17:21 UTC
First you need to identify the bottleneck(s). Comment all the lines with `$dbh` and see if it speeds up or not. Also, `truncate entry;` will be faster than `delete from entry;`	[reply] [d/l] [select]


Perl-Sensitive Sunglasses
	PerlMonks