Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

std dev calculations slow over time

by punkish (Priest)
on Oct 22, 2006 at 18:28 UTC ( [id://579883]=perlquestion: print w/replies, xml ) Need Help??

punkish has asked for the wisdom of the Perl Monks concerning the following question:

I am calculating std dev of hydrologic values using a "rolling window" of 100 points. The entire data set has 1 million rows. I am using SQLite, ActiveState Perl 5.8.8, and Math::NumberCruncher. The problem is that I am experiencing very slow performance, and annoyingly, the performance decreases over time, as the output shows below. Perhaps your eyes can point out my stupidity, if any, in the following pseudocode real code --

sub calcStdDev { my ($dbh) = @_; #------------------------------------------------------------------- # Get the count of records in the hydro table my $sth_sel_hydro_count = $dbh->prepare(qq{ SELECT Count(*) - 1 AS foo FROM hydro }); $sth_sel_hydro_count->execute; my @res = $sth_sel_hydro_count->fetchrow_array; my $count_hydro = $res[0]; #------------------------------------------------------------------- #------------------------------------------------------------------- ## Three statements # First, to select the hydro data 100 rows at a time my $sth_sel_hydro = $dbh->prepare(qq{ SELECT hydro_id, discharge, precipitation FROM hydro ORDER BY date_time LIMIT 100 OFFSET ? }); # Second, to insert std dev values my $sth_ins_stddev = $dbh->prepare(qq{ INSERT INTO stddev (stddev_discharge, stddev_precipitation) VALUES (?, ?) }); # Third, to update the hydro table with the stddev id just created my $sth_upd_hydro = $dbh->prepare(qq{ UPDATE hydro SET stddev_id = ? WHERE hydro_id IN (?) }); #------------------------------------------------------------------- my $mnc = Math::NumberCruncher->new(); my $ta = new Benchmark; for my $window (0 .. $count_hydro) { ### $sth_sel_hydro->execute($window); # Arrays to hold values my @hydro_id; my @discharge; my @precipitation; while (my $row = $sth_sel_hydro->fetchrow_arrayref) { push(@hydro_id, $row->[0]); push(@discharge, $row->[1]); push(@precipitation, $row->[2]); } $sth_sel_hydro->finish; # Returns the Standard Deviation of @array, which is a # measurement of how diverse the data are my $sd_discharge = $mnc->StandardDeviation(\@discharge, 6); my $sd_precip = $mnc->StandardDeviation(\@precipitation, 6); $sth_ins_stddev->execute($sd_discharge, $sd_precip); my $sd_id = $dbh->func('last_insert_rowid'); # Create a comma-separated string suitable for SQL IN operator ### my $hydro_id_str = join(',', @hydro_id); $sth_upd_hydro->execute($sd_id, $hydro_id_str); # Print out the progress every 10,000 rows, and commit to the db if (! ($window % 10000) ) { my $tb = new Benchmark; print "$window [" . timestr(timediff($tb, $ta)) . "]\n"; $ta = new Benchmark; $dbh->commit; } } $sth_ins_stddev->finish; $sth_upd_hydro->finish; $dbh->commit; } >perl std.pl Started Sun Oct 22 12:38:58 2006... 0 [ 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)] 10000 [100 wallclock secs (81.09 usr + 4.18 sys = 85.26 CPU)] 20000 [159 wallclock secs (136.73 usr + 5.18 sys = 141.90 CPU)] 30000 [228 wallclock secs (193.98 usr + 5.62 sys = 199.60 CPU)] 40000 [261 wallclock secs (247.19 usr + 5.67 sys = 252.85 CPU)] 50000 [315 wallclock secs (301.86 usr + 5.75 sys = 307.61 CPU)] 60000 [377 wallclock secs (356.05 usr + 13.11 sys = 369.16 CPU)] 70000 [497 wallclock secs (414.08 usr + 59.97 sys = 474.04 CPU)] 80000 [610 wallclock secs (471.53 usr + 88.27 sys = 559.80 CPU)]

Update: I should mention that the same calculations (except for the db part) take about 10 mins in Matlab. Granted I have the database in the picture, but the speed is just pathetic. And, particularly vexing as to why the process is slowing down per 10000.

Update2: per rhesa's eagle eye, updated the code with a couple of errata. Updates marked with ###. I was going to put pseudo-code, but this is now actual progressively slowing real code.

--

when small people start casting long shadows, it is time to go to bed

Replies are listed 'Best First'.
Re: std dev calculations slow over time
by brian_d_foy (Abbot) on Oct 22, 2006 at 18:53 UTC

    I don't know if this is your main problem, but performance for SQLite is poor when doing multiple inserts (see the performance notes in DBD::SQLite). Wrap all of those inserts in a transaction and commit them all at once. I see you have some commit calls in there, but I didn't see where you set up a transaction.

    Also, if you think the database might be the culprit, you can profile it with DBI::Profile. I give some examples of that in my Profiling chapter in Mastering Perl. You might also try some of the other profiling techniques to identify other slow parts.

    Good luck :)

    --
    brian d foy <brian@stonehenge.com>
    Subscribe to The Perl Review
      brian d foy doth spake

      >I see you have some commit calls in there, but
      >I didn't see where you set up a transaction.

      my $dbh = DBI->connect( "dbi:SQLite:$dsn", '','', { RaiseError => 1, AutoCommit => 0 } ) or die "$DBI::errstr\n";

      I also used Devel::Profiler to do some investigating, but dprofpp tmon.out kept on croaking on me with

      >dprofpp tmon.out Modification of non-creatable array value attempted, subscript -1 at C +:\Perl\bin/dprofpp.bat line 717, <fh> line 1250.

      so I gave up that route because I really didn't know wtf that was all about.

      --

      when small people start casting long shadows, it is time to go to bed
        Transactions (in any database support by DBI) are handled with the $dbh->begin_work and $dbh->commit methods. See the "transactions" section of DBI.pm.
        my $dbh = DBI->connect( "dbi:SQLite:$dsn", '','',
        { RaiseError => 1, AutoCommit => 0 } )
        or die "$DBI::errstr\n";
        I found with Win32 ActivePerl that explicitly setting $dbh->{AutoCommit} = 0; here helps.
Re: std dev calculations slow over time
by grep (Monsignor) on Oct 22, 2006 at 18:53 UTC
    Couple thoughts I had:
    Run it through Devel::DProf.
    But as a guess, my bet it would show the DB access would be your biggest lag. I would suggest using a bigger DB like PostgreSQL or MySQL then profile your indices.

    I haven't used MySQL much in the last couple of years, but PostgreSQL offers the excellent explain tool.



    grep
    One dead unjugged rabbit fish later
Re: std dev calculations slow over time
by rhesa (Vicar) on Oct 22, 2006 at 19:01 UTC
    If I were you, I'd try to narrow down where the slowdown is actually occurring. Right now it isn't clear if it is due to the database, or the calculations. My advice is to add more timing points first.

    My suspicion would be on your arrays: they might be growing without you knowing it, which would result in the calculations to take ever longer.

    Some comments on your code: you seem to indicate it is pseudo code, so take this with a grain of salt.

    1. Your $sth_sel_hydro has a placeholder for the offset, but you never execute it here -- do you pass in a suitable value for the placeholder?
    2. The line where you construct the IN string mentions a $aref_hydro_id, but I don't see that declared or filled with data -- I assume you meant to use @hydro_id
Re: std dev calculations slow over time
by toma (Vicar) on Oct 22, 2006 at 22:17 UTC
    I'll guess that the progressive slowdown looks like the sth_select_hydro SQL, which has to do progressively more work to find the 100 rows that you want.

    I would select all the rows first and put the data into a text file, or RAM if you have enough of it.

    You mentioned the speed in Matlab being faster. The Matlab code may be smart enough to take advantage of the massive redundancy in your calculation. As you step through the array, your calculation has to operate on what is mostly the same list of numbers over and over. The only points that change are the first and last points in the array. So the clever Matlab routine detects this and does a much smaller calculation, in effect subtracting off the last number from the sum and adding the first number to the sum. It is storing the points for the average in a circular buffer and avoiding the work of recalculation. There are special forms of the statistical formulas for average and standard deviation that enable a result to be incrementally updated. The formulas are in the Wikipedia article on Standard Deviation, in the section 'Rapid Calculation Methods.'

    Also, the moving average is going to move very slowly from point to point. Usually, you don't need to know that many points in a moving average, and you don't really need to calculate them all. This is called decimation.

    It should work perfectly the first time! - toma
Re: std dev calculations slow over time
by BrowserUk (Patriarch) on Oct 23, 2006 at 06:15 UTC

    I wouldn't be at all surprised if you removed the StdDev calculations from the loop, if the time taken for fetching the data from the db in 999901 groups of 100 values stayed pretty much the same. Just issuing 999,001, even simple queries to the db will consume a good proportion of the times you are seeing.

    Also, I'm not sure how good SQLight is at cacheing queries and indexing data, but this query

    my $sth_sel_hydro = $dbh->prepare(qq{ SELECT hydro_id, discharge, precipitation FROM hydro ORDER BY date_time LIMIT 100 OFFSET ? });

    could be resorting and subseting your dataset each time, which could explain the steadily increasing time for each cycle as it would be having to skip over the first 1, 2, 3,... 999900 records in the sorted data on successive iterations. (With apologies in advance to the authors of SQLight if I'm wrong--but it would explain the timings).

    Given that total input dataset will occupy around 150 MB, the calculated results set a similar figure, and that calculating the running StdDev of groups of 100 in 1e6 takes around 6 minutes when done in Perl-land:

    And it scales pretty linearly as one would expect.

    Fetching the data in one go, putting the results back in one go and then fixing up the relations between the data and the stdDevs should be much, much quicker.

    Fetching large volumes of data from a db in gazillions of iddy-biddy chunks is never going win any races.


    BTW. On the subject of that Hydro table -> stddev table relationship. Maybe I am misreading your code, but it looks to me as if you are:

    1. Calculating the stddevs for items 1 .. 100.
    2. Inserting those stddevs into the stddev table and retrieving the stddev_id.
    3. Updating items 1 .. 100 with the stddev_id.

    Then

    1. Calculating the stddevs for items 2 .. 101.
    2. Inserting those stddevs into the stddev table and retrieving the stddev_id.
    3. Updating items 2 .. 101 with the stddev_id.

    Which means that the stddev_ids for items 100 .. 999900 will each be set and reset 100 times each? But each will obviously only retain the last value set.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: std dev calculations slow over time
by tphyahoo (Vicar) on Oct 22, 2006 at 19:43 UTC
Re: std dev calculations slow over time
by andreas1234567 (Vicar) on Oct 23, 2006 at 20:05 UTC
    Alternatively, if you choose to use a database, you can take advantage of MySQL's (or Oracle's) DEV (or STDDEV) operator directly.

    However, this solution scales less well than BrowserUK's pure perl solution.

    mysql 4.1 on linux 2.6 with perl 5.8.5

    --
    Andreas

      Nice++.

      How refreshing to see some working DB code posted and benchmarked rather than just alluded to.

      As for the scaling, the pure perl only wins whilst the dataset will fit in memory...then your DB appproach wins hands down without resorting to messy overlapping reads :)


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://579883]
Approved by Arunbear
Front-paged by brian_d_foy
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-25 23:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found