std dev calculations slow over time

punkish has asked for the wisdom of the Perl Monks concerning the following question:

I am calculating std dev of hydrologic values using a "rolling window" of 100 points. The entire data set has 1 million rows. I am using SQLite, ActiveState Perl 5.8.8, and Math::NumberCruncher. The problem is that I am experiencing very slow performance, and annoyingly, the performance decreases over time, as the output shows below. Perhaps your eyes can point out my stupidity, if any, in the following ~~pseudocode~~ real code --

sub calcStdDev {
  my ($dbh) = @_;
  
 
  #-------------------------------------------------------------------
  # Get the count of records in the hydro table
  my $sth_sel_hydro_count = $dbh->prepare(qq{
    SELECT Count(*) - 1 AS foo FROM hydro 
  });
  
  $sth_sel_hydro_count->execute;
  
  my @res = $sth_sel_hydro_count->fetchrow_array;
  my $count_hydro = $res[0];
  #-------------------------------------------------------------------
  
  
  #-------------------------------------------------------------------
  ## Three statements
  # First, to select the hydro data 100 rows at a time
  my $sth_sel_hydro = $dbh->prepare(qq{
    SELECT hydro_id, discharge, precipitation  
    FROM hydro 
    ORDER BY date_time 
    LIMIT 100 OFFSET ?
  });
  
  # Second, to insert std dev values
  my $sth_ins_stddev = $dbh->prepare(qq{
    INSERT INTO stddev (stddev_discharge, stddev_precipitation) 
    VALUES (?, ?)
  });
  
  # Third, to update the hydro table with the stddev id just created
  my $sth_upd_hydro = $dbh->prepare(qq{
    UPDATE hydro SET stddev_id = ? WHERE hydro_id IN (?)
  });
  #-------------------------------------------------------------------
  
  
  my $mnc = Math::NumberCruncher->new();
  
  my $ta = new Benchmark;

  for my $window (0 .. $count_hydro) {
    
###
    $sth_sel_hydro->execute($window);

    # Arrays to hold values
    my @hydro_id; my @discharge; my @precipitation;
    
    while (my $row = $sth_sel_hydro->fetchrow_arrayref) {
      push(@hydro_id, $row->[0]);
      push(@discharge, $row->[1]);
      push(@precipitation, $row->[2]);
    }
    
    $sth_sel_hydro->finish;
    
    
    # Returns the Standard Deviation of @array, which is a 
    # measurement of how diverse the data are
    my $sd_discharge = $mnc->StandardDeviation(\@discharge, 6);
    my $sd_precip    = $mnc->StandardDeviation(\@precipitation, 6);
    
    $sth_ins_stddev->execute($sd_discharge, $sd_precip);
    my $sd_id = $dbh->func('last_insert_rowid');
    
    # Create a comma-separated string suitable for SQL IN operator
###
    my $hydro_id_str = join(',', @hydro_id);
    
    $sth_upd_hydro->execute($sd_id, $hydro_id_str);
    
    # Print out the progress every 10,000 rows, and commit to the db
    if (! ($window % 10000) ) {
      my $tb = new Benchmark;
      print "$window [" . timestr(timediff($tb, $ta)) . "]\n";
      $ta = new Benchmark;
      $dbh->commit;
    }
  }
  
  $sth_ins_stddev->finish;
  $sth_upd_hydro->finish;
  $dbh->commit;
}

>perl std.pl
Started Sun Oct 22 12:38:58 2006...
0 [ 0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)]
10000 [100 wallclock secs (81.09 usr +  4.18 sys = 85.26 CPU)]
20000 [159 wallclock secs (136.73 usr +  5.18 sys = 141.90 CPU)]
30000 [228 wallclock secs (193.98 usr +  5.62 sys = 199.60 CPU)]
40000 [261 wallclock secs (247.19 usr +  5.67 sys = 252.85 CPU)]
50000 [315 wallclock secs (301.86 usr +  5.75 sys = 307.61 CPU)]
60000 [377 wallclock secs (356.05 usr + 13.11 sys = 369.16 CPU)]
70000 [497 wallclock secs (414.08 usr + 59.97 sys = 474.04 CPU)]
80000 [610 wallclock secs (471.53 usr + 88.27 sys = 559.80 CPU)]
[download]

Update: I should mention that the same calculations (except for the db part) take about 10 mins in Matlab. Granted I have the database in the picture, but the speed is just pathetic. And, particularly vexing as to why the process is slowing down per 10000.

Update2: per rhesa's eagle eye, updated the code with a couple of errata. Updates marked with ###. I was going to put pseudo-code, but this is now actual progressively slowing real code.

--

when small people start casting long shadows, it is time to go to bed

Comment on std dev calculations slow over time Select or Download Code

Replies are listed 'Best First'.

Re: std dev calculations slow over time
by brian_d_foy (Abbot) on Oct 22, 2006 at 18:53 UTC

I don't know if this is your main problem, but performance for SQLite is poor when doing multiple inserts (see the performance notes in DBD::SQLite). Wrap all of those inserts in a transaction and commit them all at once. I see you have some commit calls in there, but I didn't see where you set up a transaction.

Also, if you think the database might be the culprit, you can profile it with DBI::Profile. I give some examples of that in my Profiling chapter in Mastering Perl. You might also try some of the other profiling techniques to identify other slow parts.

Good luck :)

--
brian d foy <brian@stonehenge.com>
Subscribe to The Perl Review

[reply]

Re^2: std dev calculations slow over time

by punkish (Priest) on Oct 22, 2006 at 19:02 UTC

brian d foy

>I see you have some commit calls in there, but
>I didn't see where you set up a transaction.

  my $dbh = DBI->connect(
    "dbi:SQLite:$dsn", '','', { RaiseError => 1, AutoCommit => 0 }
  ) or die "$DBI::errstr\n";
[download]

I also used Devel::Profiler to do some investigating, but dprofpp tmon.out kept on croaking on me with

>dprofpp tmon.out
Modification of non-creatable array value attempted, subscript -1 at C
+:\Perl\bin/dprofpp.bat line 717, <fh> line 1250.
[download]

so I gave up that route because I really didn't know wtf that was all about.

--

when small people start casting long shadows, it is time to go to bed

[reply]
[d/l]
[select]

Re^3: std dev calculations slow over time

by AK108 (Friar) on Oct 22, 2006 at 21:25 UTC

$dbh->begin_work

$dbh->commit

transactions

DBI.pm

[reply]

Re^3: std dev calculations slow over time

by randyk (Parson) on Oct 22, 2006 at 23:09 UTC

my $dbh = DBI->connect( "dbi:SQLite:$dsn", '','', { RaiseError => 1, AutoCommit => 0 } ) or die "$DBI::errstr\n";

$dbh->{AutoCommit} = 0;

[reply]

Re: std dev calculations slow over time
by grep (Monsignor) on Oct 22, 2006 at 18:53 UTC

Devel::DProf

PostgreSQL

MySQL

I haven't used MySQL much in the last couple of years, but PostgreSQL offers the excellent explain tool.

grep

One dead unjugged rabbit fish later

[reply]

Re: std dev calculations slow over time
by rhesa (Vicar) on Oct 22, 2006 at 19:01 UTC

My suspicion would be on your arrays: they might be growing without you knowing it, which would result in the calculations to take ever longer.

Some comments on your code: you seem to indicate it is pseudo code, so take this with a grain of salt.

Your $sth_sel_hydro has a placeholder for the offset, but you never execute it here -- do you pass in a suitable value for the placeholder?
The line where you construct the IN string mentions a $aref_hydro_id, but I don't see that declared or filled with data -- I assume you meant to use @hydro_id

[reply]
[d/l]
[select]

Re: std dev calculations slow over time
by toma (Vicar) on Oct 22, 2006 at 22:17 UTC

I would select all the rows first and put the data into a text file, or RAM if you have enough of it.

You mentioned the speed in Matlab being faster. The Matlab code may be smart enough to take advantage of the massive redundancy in your calculation. As you step through the array, your calculation has to operate on what is mostly the same list of numbers over and over. The only points that change are the first and last points in the array. So the clever Matlab routine detects this and does a much smaller calculation, in effect subtracting off the last number from the sum and adding the first number to the sum. It is storing the points for the average in a circular buffer and avoiding the work of recalculation. There are special forms of the statistical formulas for average and standard deviation that enable a result to be incrementally updated. The formulas are in the Wikipedia article on Standard Deviation, in the section 'Rapid Calculation Methods.'

Also, the moving average is going to move very slowly from point to point. Usually, you don't need to know that many points in a moving average, and you don't really need to calculate them all. This is called decimation.

It should work perfectly the first time! - toma

[reply]

Re: std dev calculations slow over time
by BrowserUk (Patriarch) on Oct 23, 2006 at 06:15 UTC

I wouldn't be at all surprised if you removed the StdDev calculations from the loop, if the time taken for fetching the data from the db in 999901 groups of 100 values stayed pretty much the same. Just issuing 999,001, even simple queries to the db will consume a good proportion of the times you are seeing.

Also, I'm not sure how good SQLight is at cacheing queries and indexing data, but this query

  my $sth_sel_hydro = $dbh->prepare(qq{
    SELECT hydro_id, discharge, precipitation  
    FROM hydro 
    ORDER BY date_time 
    LIMIT 100 OFFSET ?
  });
[download]

could be resorting and subseting your dataset each time, which could explain the steadily increasing time for each cycle as it would be having to skip over the first 1, 2, 3,... 999900 records in the sorted data on successive iterations. (With apologies in advance to the authors of SQLight if I'm wrong--but it would explain the timings).

Given that total input dataset will occupy around 150 MB, the calculated results set a similar figure, and that calculating the running StdDev of groups of 100 in 1e6 takes around 6 minutes when done in Perl-land:


Your skill will accomplish what the force of many cannot
	PerlMonks