in reply to std dev calculations slow over time
I wouldn't be at all surprised if you removed the StdDev calculations from the loop, if the time taken for fetching the data from the db in 999901 groups of 100 values stayed pretty much the same. Just issuing 999,001, even simple queries to the db will consume a good proportion of the times you are seeing.
Also, I'm not sure how good SQLight is at cacheing queries and indexing data, but this query
my $sth_sel_hydro = $dbh->prepare(qq{ SELECT hydro_id, discharge, precipitation FROM hydro ORDER BY date_time LIMIT 100 OFFSET ? });
could be resorting and subseting your dataset each time, which could explain the steadily increasing time for each cycle as it would be having to skip over the first 1, 2, 3,... 999900 records in the sorted data on successive iterations. (With apologies in advance to the authors of SQLight if I'm wrong--but it would explain the timings).
Given that total input dataset will occupy around 150 MB, the calculated results set a similar figure, and that calculating the running StdDev of groups of 100 in 1e6 takes around 6 minutes when done in Perl-land:
#! perl -slw use strict; use Benchmark::Timer; use List::Util qw[ sum ]; use constant { ID => 0, DISCHARGE => 1, PRECIPITATION => 2 }; sub stdDev { my $ave = 0; $ave += $_ for @_; $ave /= @_; my $variance = sum( map +($_ - $ave)**2, @_ ) / @_; return sqrt $variance; } our $G ||= 100; our $N ||= 1e6; my $T = new Benchmark::Timer; my @data = map{ [ $_, rand 100, rand 10 ] } 1 .. $N; my $label = "Moving Std Dev: $G of $N items"; $T->start( $label ); $G--; my @stdDevs = map [], 1 .. $N - $G; for my $item ( 0 .. $#data - $G ) { @{ $stdDevs[ $item ] } = ( stdDev( map{ $_->[ DISCHARGE ] } @data[ $item .. $item + $ +G ] ) , stdDev( map{ $_->[ PRECIPITATION ] } @data[ $item .. $item + $ +G ] ) ); } $T->stop( $label ); $T->report; #printf "id:$_ StdDev(Discharge):$stdDevs[ $_ ][ 0 ] StdDev(Precipitat +ion):$stdDevs[ $_ ][ 1 ]\n" for 0 .. $#stdDevs; __END__ C:\test>579883 -N=1e3 1 trial of Moving Std Dev: 100 of 1e3 items (323.326ms total) C:\test>579883 -N=1e4 1 trial of Moving Std Dev: 100 of 1e4 items (3.750s total) C:\test>579883 -N=1e5 1 trial of Moving Std Dev: 100 of 1e5 items (36.063s total) C:\test>579883 -N=1e6 1 trial of Moving Std Dev: 100 of 1e6 items (372.109s total)
And it scales pretty linearly as one would expect.
Fetching the data in one go, putting the results back in one go and then fixing up the relations between the data and the stdDevs should be much, much quicker.
Fetching large volumes of data from a db in gazillions of iddy-biddy chunks is never going win any races.
BTW. On the subject of that Hydro table -> stddev table relationship. Maybe I am misreading your code, but it looks to me as if you are:
Then
Which means that the stddev_ids for items 100 .. 999900 will each be set and reset 100 times each? But each will obviously only retain the last value set.
|
|---|