Re: std dev calculations slow over time

I wouldn't be at all surprised if you removed the StdDev calculations from the loop, if the time taken for fetching the data from the db in 999901 groups of 100 values stayed pretty much the same. Just issuing 999,001, even simple queries to the db will consume a good proportion of the times you are seeing.

Also, I'm not sure how good SQLight is at cacheing queries and indexing data, but this query

  my $sth_sel_hydro = $dbh->prepare(qq{
    SELECT hydro_id, discharge, precipitation  
    FROM hydro 
    ORDER BY date_time 
    LIMIT 100 OFFSET ?
  });
[download]

could be resorting and subseting your dataset each time, which could explain the steadily increasing time for each cycle as it would be having to skip over the first 1, 2, 3,... 999900 records in the sorted data on successive iterations. (With apologies in advance to the authors of SQLight if I'm wrong--but it would explain the timings).

Given that total input dataset will occupy around 150 MB, the calculated results set a similar figure, and that calculating the running StdDev of groups of 100 in 1e6 takes around 6 minutes when done in Perl-land:

#! perl -slw
use strict;
use Benchmark::Timer;
use List::Util qw[ sum ];
use constant { ID => 0, DISCHARGE => 1, PRECIPITATION => 2 };

sub stdDev {
    my $ave = 0;
    $ave += $_ for @_;
    $ave /= @_;
    my $variance =  sum( map +($_ - $ave)**2, @_ ) / @_;
    return sqrt $variance;
}

our $G ||= 100;
our $N ||= 1e6;

my $T = new Benchmark::Timer;
my @data = map{ [ $_, rand 100, rand 10 ] } 1 .. $N;

my $label = "Moving Std Dev: $G of $N items";
$T->start( $label );

$G--;
my @stdDevs = map [], 1 .. $N - $G;
for my $item ( 0 .. $#data - $G ) {
    @{ $stdDevs[ $item ] } = (
        stdDev( map{ $_->[ DISCHARGE     ] } @data[ $item .. $item + $
+G ] ) ,
        stdDev( map{ $_->[ PRECIPITATION ] } @data[ $item .. $item + $
+G ] )
    );
}

$T->stop( $label );
$T->report;

#printf "id:$_ StdDev(Discharge):$stdDevs[ $_ ][ 0 ] StdDev(Precipitat
+ion):$stdDevs[ $_ ][ 1 ]\n" for 0 .. $#stdDevs;

__END__
C:\test>579883 -N=1e3
1 trial of Moving Std Dev: 100 of 1e3 items (323.326ms total)
C:\test>579883 -N=1e4
1 trial of Moving Std Dev: 100 of 1e4 items (3.750s total)
C:\test>579883 -N=1e5
1 trial of Moving Std Dev: 100 of 1e5 items (36.063s total)
C:\test>579883 -N=1e6
1 trial of Moving Std Dev: 100 of 1e6 items (372.109s total)
[download]

And it scales pretty linearly as one would expect.

Fetching the data in one go, putting the results back in one go and then fixing up the relations between the data and the stdDevs should be much, much quicker.

Fetching large volumes of data from a db in gazillions of iddy-biddy chunks is never going win any races.

BTW. On the subject of that Hydro table -> stddev table relationship. Maybe I am misreading your code, but it looks to me as if you are:

Calculating the stddevs for items 1 .. 100.
Inserting those stddevs into the stddev table and retrieving the stddev_id.
Updating items 1 .. 100 with the stddev_id.

Then

Calculating the stddevs for items 2 .. 101.
Inserting those stddevs into the stddev table and retrieving the stddev_id.
Updating items 2 .. 101 with the stddev_id.

Which means that the stddev_ids for items 100 .. 999900 will each be set and reset 100 times each? But each will obviously only retain the last value set.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re: std dev calculations slow over time Select or Download Code