in reply to Re^6: speeding up row by row lookup in a large db
in thread speeding up row by row lookup in a large db

Hm. Then load the cell<->met index as one array, and the 400 x 7300 x 9 datasets as another 3D array.

In the following dummied up test, loading all the data takes around 45 seconds and 2.9 GB. That's space & comma separated ascii. You could certainly speed the loading up using Storable

And accessing one value within the datasets associated with each cell, happens at a rate of 27,300 cells per second. That means you could access (1 value for) all 1 million cells in just over 36 seconds.

Update: 800,000+ cells per second! Less than 2 seconds to process the million cells.

#! perl -slw use strict; use Storable; use Time::HiRes qw[ time ]; our $N ||= 1e3; my $start1 = time; my @cells; open CELLS, '<', 'cells.dat' or die $!; m[(\d+)\s+(\d+)] and $cells[ $1 ] = $2 while <CELLS>; close CELLS; my @met; $#met = 401; for my $met ( 1 .. 400 ) { open IN, '<', sprintf "met%04d.dat", $met or die "dat $met : $!"; local $/; my @data; $#data = 7300; @data = map [ split ', ' ], map{ split "\n" } <IN>; close IN; push @met, \@data; } printf 'All data loaded in %.2f seconds', time() - $start1; <>; my $start2 = time; for my $cell ( 1 .. $N ) { my $row = int rand 7300; my $col = int rand 9; my $value = $met[ $cells[ $cell ] ][ $row ][ $col ]; } printf "Accessed $N met datasets at a rate of %.2f\n", $N / ( time - $start2 ); __END__ c:\test\752472>752472.pl All data loaded in 41.81 seconds Accessed 1000 met datasets at a rate of 27317.87 c:\test\752472>752472 -N=1e6 All data loaded in 28.31 seconds Accessed 1e6 met datasets at a rate of 842459.96

cells.dat is a flat ascii file with records associating the cell to it's met data:

c:\test\752472>head CELLS.DAT 0000001 0170 0000002 0147 0000003 0349 0000004 0251 0000005 0104 0000006 0006 0000007 0121 0000008 0312 0000009 0325 0000010 0131

And metNNNN.dat are:

c:\test\752472>head MET0001.dat 650, 81, 762, 183, 875, 391, 191, 871, 712 686, 359, 795, 893, 570, 551, 308, 229, 515 140, 420, 808, 387, 642, 744, 800, 151, 215 136, 151, 934, 113, 454, 934, 75, 134, 332 651, 349, 599, 377, 269, 994, 886, 697, 260 712, 42, 166, 428, 300, 533, 331, 821, 60 625, 367, 483, 652, 461, 986, 350, 538, 343 715, 58, 683, 261, 427, 947, 662, 195, 564 49, 831, 230, 921, 330, 720, 998, 435, 975 299, 502, 850, 997, 285, 829, 694, 946, 131

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^8: speeding up row by row lookup in a large db
by BrowserUk (Patriarch) on Mar 23, 2009 at 05:30 UTC

    Alternatively, trade a little time for a lot of space by loading the met data as a 2D rather than 3D array. Ie. Each element of the second level is a string containing the 9 values, rather than an array containing 9 elements.

    This cuts the load time to 8 seconds; the memory usage to a mere 500 MB. It means splitting the column data repeatedly to access the indivual values which slows it down, but still allows the processing of the million cells in around 6 seconds:

    #! perl -slw use strict; use Storable; use Time::HiRes qw[ time ]; our $N ||= 1e3; my $start1 = time; my @cells; open CELLS, '<', 'cells.dat' or die $!; m[(\d+)\s+(\d+)] and $cells[ $1 ] = $2 while <CELLS>; close CELLS; my @met = []; for my $met ( 1 .. 400 ) { open IN, '<', sprintf "met%04d.dat", $met or die "dat $met : $!"; local $/; my @data; $#data = 7300; @data = map{ split "\n" } <IN>; close IN; push @met, \@data; } printf 'All data loaded in %.2f seconds', time() - $start1; <>; my $start2 = time; for my $cell ( 1 .. $N ) { my $row = int rand 7300; my $col = int rand 9; my $rowData = $met[ $cells[ $cell ] ][ $row ]; my $value = (split ', ', $rowData)[ $col ] } printf "Accessed $N met datasets at a rate of %.2f\n", $N / ( time - $start2 ); __END__ c:\test\752472>752472 -N=1e6 All data loaded in 8.93 seconds Accessed 1e6 met datasets at a rate of 165098.24

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.