Plus, each "pixel" in the image is an array of my 6 or 7 values. So, we are back to complexity in retrieval.

I don't know why you say that?

The code below plucks a rectangle of data points of a specified size from a specified year/day "image". I dummied up two days worth of data files:

C:\test>dir tmp 18/06/2010 14:28 371,260,960 2010.168.bin 18/06/2010 14:22 371,260,960 2010.169.bin 2 File(s) 742,521,920 bytes

And this shows the code plucking 10 x 10 x 7 datasets from various positions within each of those files (with the output redirected for clarity). The code is just a little math, a read and an unpack--most of the posted code is just parsing the arguments and formatting the output and timing:

for /l %y in (0,500,2500) do @845309 2010 169 2293:9 %y:9 >nul [2010 169 2293:9 0:9] Took 0.020 seconds [2010 169 2293:9 500:9] Took 0.017 seconds [2010 169 2293:9 1000:9] Took 0.017 seconds [2010 169 2293:9 1500:9] Took 0.017 seconds [2010 169 2293:9 2000:9] Took 0.019 seconds [2010 169 2293:9 2500:9] Took 0.017 seconds for /l %y in (0,500,2500) do @845309 2010 168 2293:9 %y:9 >nul [2010 168 2293:9 0:9] Took 0.021 seconds [2010 168 2293:9 500:9] Took 0.017 seconds [2010 168 2293:9 1000:9] Took 0.017 seconds [2010 168 2293:9 1500:9] Took 0.066 seconds [2010 168 2293:9 2000:9] Took 0.023 seconds [2010 168 2293:9 2500:9] Took 0.017 seconds

And here 100 x 100 x 7 data points. Very linear as expected.

for /l %y in (0,500,2500) do @845309 2010 169 2293:99 %y:99 >nul [2010 169 2293:99 0:99] Took 0.115 seconds [2010 169 2293:99 500:99] Took 0.115 seconds [2010 169 2293:99 1000:99] Took 0.117 seconds [2010 169 2293:99 1500:99] Took 0.116 seconds [2010 169 2293:99 2000:99] Took 0.115 seconds [2010 169 2293:99 2500:99] Took 0.116 seconds for /l %y in (0,500,2500) do @845309 2010 168 2293:99 %y:99 >nul [2010 168 2293:99 0:99] Took 0.125 seconds [2010 168 2293:99 500:99] Took 0.116 seconds [2010 168 2293:99 1000:99] Took 0.114 seconds [2010 168 2293:99 1500:99] Took 0.115 seconds [2010 168 2293:99 2000:99] Took 0.115 seconds [2010 168 2293:99 2500:99] Took 0.115 seconds

So, very simple code and very fast. And the entire uncompressed dataset (23 * 365.25 * 354MB) = < 3TB.

With compression, that could be as little as 1.3 TB. Though you'd have a pay the price for unpacking--~30 seconds per file.

18/06/2010 14:28 237,173,932 2010.168.bin.gz 18/06/2010 14:22 175,868,626 2010.169.bin.bz2

But the main point of partitioning your dataset this way is that you reduce the search space to 1/8th of 1% as soon as you specify the year/day. And there is no searching of indexes involved in the rest of the query. Just a simple calculation and a direct seek.

Anyway, t'is your data and your employers money :)

The (flawed) test demo code.

#! perl -slw use strict; use Time::HiRes qw[ time ]; use constant { XMAX => 4587, YMAX => 2889, REC_SIZE => 7 * 4, }; my( $year, $day, $xRange, $yRange ) = @ARGV; my( $xStart, $xEnd ) = split ':', $xRange $xEnd += $xStart; my( $yStart, $yEnd ) = split ':', $yRange; $yEnd += $yStart; my $start = time; open BIN, '<:perlio', "tmp/$year.$day.bin" or die $!; binmode BIN; my $xLen = ( $xEnd - $xStart + 1 ) * REC_SIZE; for my $y ( $yStart .. $yEnd ) { my $pos = ( $y * XMAX * REC_SIZE ) + $xStart * REC_SIZE; seek BIN, $pos, 0; my $read = sysread( BIN, my $rec, $xLen ) or die $!; my @recs = unpack '(A28)*', $rec; for my $x ( $xStart .. $xEnd ) { my( $a, $b, $c, $d, $e, $f, $g ) = unpack 'N7', $recs[ $x - $x +Start ]; printf "%4d.%03d : %10u %10u %10u %10u %10u %10u %10u\n", $year, $day, $a, $b, $c, $d, $e, $f, $g//0; } } close BIN; printf STDERR "[@ARGV] Took %.3f seconds\n", time() - $start;

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

In reply to Re^3: Strategy for managing a very large database with Perl (Video) by BrowserUk
in thread Strategy for managing a very large database with Perl by punkish

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.