Plus, each "pixel" in the image is an array of my 6 or 7 values. So, we are back to complexity in retrieval.
I don't know why you say that?
The code below plucks a rectangle of data points of a specified size from a specified year/day "image". I dummied up two days worth of data files:
C:\test>dir tmp 18/06/2010 14:28 371,260,960 2010.168.bin 18/06/2010 14:22 371,260,960 2010.169.bin 2 File(s) 742,521,920 bytes
And this shows the code plucking 10 x 10 x 7 datasets from various positions within each of those files (with the output redirected for clarity). The code is just a little math, a read and an unpack--most of the posted code is just parsing the arguments and formatting the output and timing:
for /l %y in (0,500,2500) do @845309 2010 169 2293:9 %y:9 >nul [2010 169 2293:9 0:9] Took 0.020 seconds [2010 169 2293:9 500:9] Took 0.017 seconds [2010 169 2293:9 1000:9] Took 0.017 seconds [2010 169 2293:9 1500:9] Took 0.017 seconds [2010 169 2293:9 2000:9] Took 0.019 seconds [2010 169 2293:9 2500:9] Took 0.017 seconds for /l %y in (0,500,2500) do @845309 2010 168 2293:9 %y:9 >nul [2010 168 2293:9 0:9] Took 0.021 seconds [2010 168 2293:9 500:9] Took 0.017 seconds [2010 168 2293:9 1000:9] Took 0.017 seconds [2010 168 2293:9 1500:9] Took 0.066 seconds [2010 168 2293:9 2000:9] Took 0.023 seconds [2010 168 2293:9 2500:9] Took 0.017 seconds
And here 100 x 100 x 7 data points. Very linear as expected.
for /l %y in (0,500,2500) do @845309 2010 169 2293:99 %y:99 >nul [2010 169 2293:99 0:99] Took 0.115 seconds [2010 169 2293:99 500:99] Took 0.115 seconds [2010 169 2293:99 1000:99] Took 0.117 seconds [2010 169 2293:99 1500:99] Took 0.116 seconds [2010 169 2293:99 2000:99] Took 0.115 seconds [2010 169 2293:99 2500:99] Took 0.116 seconds for /l %y in (0,500,2500) do @845309 2010 168 2293:99 %y:99 >nul [2010 168 2293:99 0:99] Took 0.125 seconds [2010 168 2293:99 500:99] Took 0.116 seconds [2010 168 2293:99 1000:99] Took 0.114 seconds [2010 168 2293:99 1500:99] Took 0.115 seconds [2010 168 2293:99 2000:99] Took 0.115 seconds [2010 168 2293:99 2500:99] Took 0.115 seconds
So, very simple code and very fast. And the entire uncompressed dataset (23 * 365.25 * 354MB) = < 3TB.
With compression, that could be as little as 1.3 TB. Though you'd have a pay the price for unpacking--~30 seconds per file.
18/06/2010 14:28 237,173,932 2010.168.bin.gz 18/06/2010 14:22 175,868,626 2010.169.bin.bz2
But the main point of partitioning your dataset this way is that you reduce the search space to 1/8th of 1% as soon as you specify the year/day. And there is no searching of indexes involved in the rest of the query. Just a simple calculation and a direct seek.
Anyway, t'is your data and your employers money :)
The (flawed) test demo code.
#! perl -slw use strict; use Time::HiRes qw[ time ]; use constant { XMAX => 4587, YMAX => 2889, REC_SIZE => 7 * 4, }; my( $year, $day, $xRange, $yRange ) = @ARGV; my( $xStart, $xEnd ) = split ':', $xRange $xEnd += $xStart; my( $yStart, $yEnd ) = split ':', $yRange; $yEnd += $yStart; my $start = time; open BIN, '<:perlio', "tmp/$year.$day.bin" or die $!; binmode BIN; my $xLen = ( $xEnd - $xStart + 1 ) * REC_SIZE; for my $y ( $yStart .. $yEnd ) { my $pos = ( $y * XMAX * REC_SIZE ) + $xStart * REC_SIZE; seek BIN, $pos, 0; my $read = sysread( BIN, my $rec, $xLen ) or die $!; my @recs = unpack '(A28)*', $rec; for my $x ( $xStart .. $xEnd ) { my( $a, $b, $c, $d, $e, $f, $g ) = unpack 'N7', $recs[ $x - $x +Start ]; printf "%4d.%03d : %10u %10u %10u %10u %10u %10u %10u\n", $year, $day, $a, $b, $c, $d, $e, $f, $g//0; } } close BIN; printf STDERR "[@ARGV] Took %.3f seconds\n", time() - $start;
In reply to Re^3: Strategy for managing a very large database with Perl (Video)
by BrowserUk
in thread Strategy for managing a very large database with Perl
by punkish
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |