comment on

Plus, each "pixel" in the image is an array of my 6 or 7 values. So, we are back to complexity in retrieval.

I don't know why you say that?

The code below plucks a rectangle of data points of a specified size from a specified year/day "image". I dummied up two days worth of data files:

C:\test>dir tmp
18/06/2010  14:28       371,260,960 2010.168.bin
18/06/2010  14:22       371,260,960 2010.169.bin
               2 File(s)    742,521,920 bytes
[download]

And this shows the code plucking 10 x 10 x 7 datasets from various positions within each of those files (with the output redirected for clarity). The code is just a little math, a read and an unpack--most of the posted code is just parsing the arguments and formatting the output and timing:

for /l %y in (0,500,2500) do @845309 2010 169 2293:9 %y:9 >nul
[2010 169 2293:9 0:9] Took 0.020 seconds
[2010 169 2293:9 500:9] Took 0.017 seconds
[2010 169 2293:9 1000:9] Took 0.017 seconds
[2010 169 2293:9 1500:9] Took 0.017 seconds
[2010 169 2293:9 2000:9] Took 0.019 seconds
[2010 169 2293:9 2500:9] Took 0.017 seconds

for /l %y in (0,500,2500) do @845309 2010 168 2293:9 %y:9 >nul
[2010 168 2293:9 0:9] Took 0.021 seconds
[2010 168 2293:9 500:9] Took 0.017 seconds
[2010 168 2293:9 1000:9] Took 0.017 seconds
[2010 168 2293:9 1500:9] Took 0.066 seconds
[2010 168 2293:9 2000:9] Took 0.023 seconds
[2010 168 2293:9 2500:9] Took 0.017 seconds
[download]

And here 100 x 100 x 7 data points. Very linear as expected.

for /l %y in (0,500,2500) do @845309 2010 169 2293:99 %y:99 >nul
[2010 169 2293:99 0:99] Took 0.115 seconds
[2010 169 2293:99 500:99] Took 0.115 seconds
[2010 169 2293:99 1000:99] Took 0.117 seconds
[2010 169 2293:99 1500:99] Took 0.116 seconds
[2010 169 2293:99 2000:99] Took 0.115 seconds
[2010 169 2293:99 2500:99] Took 0.116 seconds

for /l %y in (0,500,2500) do @845309 2010 168 2293:99 %y:99 >nul
[2010 168 2293:99 0:99] Took 0.125 seconds
[2010 168 2293:99 500:99] Took 0.116 seconds
[2010 168 2293:99 1000:99] Took 0.114 seconds
[2010 168 2293:99 1500:99] Took 0.115 seconds
[2010 168 2293:99 2000:99] Took 0.115 seconds
[2010 168 2293:99 2500:99] Took 0.115 seconds
[download]

So, very simple code and very fast. And the entire uncompressed dataset (23 * 365.25 * 354MB) = < 3TB.

With compression, that could be as little as 1.3 TB. Though you'd have a pay the price for unpacking--~30 seconds per file.

18/06/2010  14:28       237,173,932 2010.168.bin.gz
18/06/2010  14:22       175,868,626 2010.169.bin.bz2
[download]

But the main point of partitioning your dataset this way is that you reduce the search space to 1/8th of 1% as soon as you specify the year/day. And there is no searching of indexes involved in the rest of the query. Just a simple calculation and a direct seek.

Anyway, t'is your data and your employers money :)

The (flawed) test demo code.

#! perl -slw
use strict;
use Time::HiRes qw[ time ];
use constant {
    XMAX => 4587,
    YMAX => 2889,
    REC_SIZE => 7 * 4,
};

my( $year, $day, $xRange, $yRange ) = @ARGV;
my( $xStart, $xEnd ) = split ':', $xRange
$xEnd += $xStart;
my( $yStart, $yEnd ) = split ':', $yRange;
$yEnd += $yStart;

my $start = time;

open BIN, '<:perlio', "tmp/$year.$day.bin" or die $!;
binmode BIN;

my $xLen = ( $xEnd - $xStart + 1 ) * REC_SIZE;

for my $y ( $yStart .. $yEnd ) {
    my $pos = ( $y * XMAX * REC_SIZE ) + $xStart * REC_SIZE;

    seek BIN, $pos, 0;
    my $read = sysread( BIN, my $rec, $xLen ) or die $!;

    my @recs = unpack '(A28)*', $rec;

    for my $x ( $xStart .. $xEnd ) {
        my( $a, $b, $c, $d, $e, $f, $g ) = unpack 'N7', $recs[ $x - $x
+Start ];

        printf "%4d.%03d : %10u %10u %10u %10u %10u %10u %10u\n",
            $year, $day, $a, $b, $c, $d, $e, $f, $g//0;
    }
}
close BIN;

printf STDERR "[@ARGV] Took %.3f seconds\n", time() - $start;
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

In reply to Re^3: Strategy for managing a very large database with Perl (Video) by BrowserUk
in thread Strategy for managing a very large database with Perl by punkish

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.