Re^7: speeding up row by row lookup in a large db

Update: Because, in the future (next iteration of the project), the number of cells is going to jump to about 1.9 million, and the size of weather data set might increase (currently we have 20 years of data, hence 20*365=7300 rows). Hence, all the focus on reducing the time to retrieve all the data for any given cell.

As a quick test, I increased the weather datasets to 50 years (50*365*9) and the number of cells to 3 million. I also packed the rows (Corion's idea!) to save a little space and time. The result is that you'd need 1.4 GB of ram to hold the data; 25 seconds to load it all; and just under 19 seconds to process all 3 million:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];

our $N ||= 1e3;

my $start1 = time;

my @cells;
open CELLS, '<', 'cells.dat' or die $!;
m[(\d+)\s+(\d+)] and $cells[ $1 ] = $2 while <CELLS>;
close CELLS;


my @met = [];

for my $met ( 1 .. 400 ) {
    open IN,  '<', sprintf "met%04d.dat", $met or die "dat $met : $!";
    local $/;
    my @data; $#data = 7300;
    @data = map{ pack 'V*', split ', ', $_ } map{ split "\n" } <IN>;
    close IN;
    push @met, \@data;
}


printf 'All data loaded in %.2f seconds', time() - $start1;

<>;

my $start2 = time;
for my $cell ( 1 .. $N ) {
    my $row = int rand 7300;
    my $col = int rand 9;
    my $rowData = $met[ $cells[ $cell ] ][ $row ];
    my @values = unpack 'V*', $rowData;
    my $value = $values[ $col ];
}

printf "Accessed $N met datasets at a rate of %.2f\n",
    $N / ( time - $start2 );

__END__
c:\test\752472>752472 -N=3e6
All data loaded in 24.97 seconds
Accessed 3e6 met datasets at a rate of 161281.65
[download]

That still leaves plenty of overhead on your 32 GB machine, even if you stick with 32-bit process, and is orders of magnitude faster than any DB solution will achieve.

If you need to go faster still, you could store and load the datsets as 2D piddles (which would also save a little memory though that doesn't seem to be a problem as is), and do your math using PDL which would certainly benefit the performance if you are doing anythng like the typical manipulations applied to weather data.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re^7: speeding up row by row lookup in a large db Download Code

Replies are listed 'Best First'.
Re^8: speeding up row by row lookup in a large db by etj (Priest) on Jun 03, 2022 at 22:10 UTC
There is a PDL module PDL::IO::DBI to make this straightforward. To make it scale Really Big, you'd probably need to read the data into something like CSV, then transform it into packed data, then memory-map that (on disk) using PDL::IO::FastRaw. Pull requests to make that more automatic would be most welcome!	[reply]