Plus, each "pixel" in the image is an array of my 6 or 7 values. So, we are back to complexity in retrieval.
I don't know why you say that?
The code below plucks a rectangle of data points of a specified size from a specified year/day "image". I dummied up two days worth of data files:
C:\test>dir tmp
18/06/2010 14:28 371,260,960 2010.168.bin
18/06/2010 14:22 371,260,960 2010.169.bin
2 File(s) 742,521,920 bytes
And this shows the code plucking 10 x 10 x 7 datasets from various positions within each of those files (with the output redirected for clarity). The code is just a little math, a read and an unpack--most of the posted code is just parsing the arguments and formatting the output and timing:
for /l %y in (0,500,2500) do @845309 2010 169 2293:9 %y:9 >nul
[2010 169 2293:9 0:9] Took 0.020 seconds
[2010 169 2293:9 500:9] Took 0.017 seconds
[2010 169 2293:9 1000:9] Took 0.017 seconds
[2010 169 2293:9 1500:9] Took 0.017 seconds
[2010 169 2293:9 2000:9] Took 0.019 seconds
[2010 169 2293:9 2500:9] Took 0.017 seconds
for /l %y in (0,500,2500) do @845309 2010 168 2293:9 %y:9 >nul
[2010 168 2293:9 0:9] Took 0.021 seconds
[2010 168 2293:9 500:9] Took 0.017 seconds
[2010 168 2293:9 1000:9] Took 0.017 seconds
[2010 168 2293:9 1500:9] Took 0.066 seconds
[2010 168 2293:9 2000:9] Took 0.023 seconds
[2010 168 2293:9 2500:9] Took 0.017 seconds
And here 100 x 100 x 7 data points. Very linear as expected.
for /l %y in (0,500,2500) do @845309 2010 169 2293:99 %y:99 >nul
[2010 169 2293:99 0:99] Took 0.115 seconds
[2010 169 2293:99 500:99] Took 0.115 seconds
[2010 169 2293:99 1000:99] Took 0.117 seconds
[2010 169 2293:99 1500:99] Took 0.116 seconds
[2010 169 2293:99 2000:99] Took 0.115 seconds
[2010 169 2293:99 2500:99] Took 0.116 seconds
for /l %y in (0,500,2500) do @845309 2010 168 2293:99 %y:99 >nul
[2010 168 2293:99 0:99] Took 0.125 seconds
[2010 168 2293:99 500:99] Took 0.116 seconds
[2010 168 2293:99 1000:99] Took 0.114 seconds
[2010 168 2293:99 1500:99] Took 0.115 seconds
[2010 168 2293:99 2000:99] Took 0.115 seconds
[2010 168 2293:99 2500:99] Took 0.115 seconds
So, very simple code and very fast. And the entire uncompressed dataset (23 * 365.25 * 354MB) = < 3TB.
With compression, that could be as little as 1.3 TB. Though you'd have a pay the price for unpacking--~30 seconds per file.
18/06/2010 14:28 237,173,932 2010.168.bin.gz
18/06/2010 14:22 175,868,626 2010.169.bin.bz2
But the main point of partitioning your dataset this way is that you reduce the search space to 1/8th of 1% as soon as you specify the year/day. And there is no searching of indexes involved in the rest of the query. Just a simple calculation and a direct seek.
Anyway, t'is your data and your employers money :)
The (flawed) test demo code.
#! perl -slw
use strict;
use Time::HiRes qw[ time ];
use constant {
XMAX => 4587,
YMAX => 2889,
REC_SIZE => 7 * 4,
};
my( $year, $day, $xRange, $yRange ) = @ARGV;
my( $xStart, $xEnd ) = split ':', $xRange
$xEnd += $xStart;
my( $yStart, $yEnd ) = split ':', $yRange;
$yEnd += $yStart;
my $start = time;
open BIN, '<:perlio', "tmp/$year.$day.bin" or die $!;
binmode BIN;
my $xLen = ( $xEnd - $xStart + 1 ) * REC_SIZE;
for my $y ( $yStart .. $yEnd ) {
my $pos = ( $y * XMAX * REC_SIZE ) + $xStart * REC_SIZE;
seek BIN, $pos, 0;
my $read = sysread( BIN, my $rec, $xLen ) or die $!;
my @recs = unpack '(A28)*', $rec;
for my $x ( $xStart .. $xEnd ) {
my( $a, $b, $c, $d, $e, $f, $g ) = unpack 'N7', $recs[ $x - $x
+Start ];
printf "%4d.%03d : %10u %10u %10u %10u %10u %10u %10u\n",
$year, $day, $a, $b, $c, $d, $e, $f, $g//0;
}
}
close BIN;
printf STDERR "[@ARGV] Took %.3f seconds\n", time() - $start;
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|