in reply to Strategy for managing a very large database with Perl

I'd store it as video. Bear with me.

If we take a look at your table definition, we can exclude (for now) t_id and point_id as they are just a part of the RDBMS data representation rather than the actual data. Year and day effectively denote 'frame'. That leaves 3 ints and 3 reals.

Assuming (I know foolish), that 2 of those reals represent X & Y coordinates. We might also think of the 3 integers as being roughly equivalent to RGB. and the last as gamma (or alpha).

20 years of days is just over 7000 frames, which at say 30 frames per second equates to a video clip of just over 4 minutes.

100 billion rows / 7000 frames (assuming a roughly 16/9 frame shape) gives roughly a 6000 x 2500 frame size. That's approximately 10x the size of an HD video frame. So 40 minutes of HD movie. And 1 hour of HD movie is about 1 GB to download.

Now, that's mostly an academic exercise as whilst video codec compression is highly developed, it is lossy, and so not really much good for academic data. However, it could lead to thinking about a different way of storing your data that might both reduce the storage requirements and lend itself to the type of queries that will be made of it.

For example, if your data can be represented by a (say 6000 x 2500) 2D array of 3 ints and 1 float per point per day. Then that equates to a 6e3 x 2.5e3 x 16-bytes frame per day. Or 228 MB raw. * 20 years (7000) = 1.6 TB. Much less than your 10 TB projection.

Where has the savings come from? First we've moved the year/day information out of every record. It's now just stored once in the name of the file. We've also moved the X & Y coordinates out of every record as they are now implicit in the position of the data within the file.

And your typical queries are straight forward: "the values of a <variable> for a given point or a given set of points for a particular year+yday combination.", becomes: locate the appropriate file (date), and a direct read of 16 bytes from a directly calculated offset for each point or given set of points.

No data loss, just a more compact representation that lends itself directly to the types of queries anticipated.

Of course, once you store the data this way, you can then use some other video techniques to compress it further--without loss. You might for example store a "full frame" for the first day of every month, and then a set of { X,Y new value(s) } changes per day in between. This might make sense if only a small subset of the full frame is likely to change on any given day. (Much like video phones transmit change sets that are applied to periodic full frames.)

The full dataset for any given date can quickly be reconstructed (losslessly) by grabbing the preceding "full frame" and then applying the change sets to bring it up to date. And if the particular query being satisfied only applies to a small area of the full frame then pre-filtering the change sets speeds the processing.

Finally, if the 3 ints and the real for each X,Y position are compatible with representation as RGB & gamma (or one of the other color encodings), then displaying (sub-areas of) the frames as a movie would be a powerful way of visualising the changes in the datasets over time.

Just a thought that normalised relational forms are often not the best way of storing and manipulating datasets.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy
  • Comment on Re: Strategy for managing a very large database with Perl (Video)

Replies are listed 'Best First'.
Re^2: Strategy for managing a very large database with Perl (Video)
by punkish (Priest) on Jun 18, 2010 at 12:52 UTC
    BrowerUk> I'd store it as video. Bear with me.
    
    First, a wonderfully whacked out suggestion. I wouldn't expect any less from you, BrowserUk, as you always come up with an intriguing possibility.

    A few specific responses: First, a series of video frames is not that outlandish, actually. A video frame is nothing but a series of 2D images that flutter by our eyes at 30 fps. So, it is akin to storing a bunch of 2D images. Doing locational searches is, as you suggested, finding the correct file, and then finding the subset from that file.

    But, these would be a lot of very large images... a pretty bad combination. My dataset's 2D dims are 4,587 X 2,889 = 13,251,843 points. 23 years is 23 * 365 (or 366) = 8395 (or 8418) images. Plus, each "pixel" in the image is an array of my 6 or 7 values. So, we are back to complexity in retrieval.

    See my response to moritz at Re^2: Strategy for managing a very large database with Perl. What am I optimizing for?

    • Space? I don't really care about space, because disk drive is a lot cheaper than my time.
    • Integrity? Once loaded, the data are read-only.
    • Ease and speed of retrieval? Yes and Yes.
    I only care about how easily and quickly I can get data that I want. Say, I want to create an image of variable a over an area for a particular day. That is a simple db query SELECT var FROM table WHERE year = ? AND yday = ?. Note: A columnar database such as Vertica or MonetDB might be very speed efficient for these queries, but those db lack spatial searches, and suffer from the lack of ease aspect.

    Nevertheless, your idea is very intriguing, and I am going to do a few tests with array storage. Of course, as mentioned in an earlier posting, the data are already in NetCDF, an array storage format, so I have to do something new that overcomes the shortcomings of the current format. The main shortcoming of the current format is the inability to do spatial searches for arbitrary spatial bounds.

    Thanks again, for a wonderful response.

    --

    when small people start casting long shadows, it is time to go to bed
      Plus, each "pixel" in the image is an array of my 6 or 7 values. So, we are back to complexity in retrieval.

      I don't know why you say that?

      The code below plucks a rectangle of data points of a specified size from a specified year/day "image". I dummied up two days worth of data files:

      C:\test>dir tmp 18/06/2010 14:28 371,260,960 2010.168.bin 18/06/2010 14:22 371,260,960 2010.169.bin 2 File(s) 742,521,920 bytes

      And this shows the code plucking 10 x 10 x 7 datasets from various positions within each of those files (with the output redirected for clarity). The code is just a little math, a read and an unpack--most of the posted code is just parsing the arguments and formatting the output and timing:

      for /l %y in (0,500,2500) do @845309 2010 169 2293:9 %y:9 >nul [2010 169 2293:9 0:9] Took 0.020 seconds [2010 169 2293:9 500:9] Took 0.017 seconds [2010 169 2293:9 1000:9] Took 0.017 seconds [2010 169 2293:9 1500:9] Took 0.017 seconds [2010 169 2293:9 2000:9] Took 0.019 seconds [2010 169 2293:9 2500:9] Took 0.017 seconds for /l %y in (0,500,2500) do @845309 2010 168 2293:9 %y:9 >nul [2010 168 2293:9 0:9] Took 0.021 seconds [2010 168 2293:9 500:9] Took 0.017 seconds [2010 168 2293:9 1000:9] Took 0.017 seconds [2010 168 2293:9 1500:9] Took 0.066 seconds [2010 168 2293:9 2000:9] Took 0.023 seconds [2010 168 2293:9 2500:9] Took 0.017 seconds

      And here 100 x 100 x 7 data points. Very linear as expected.

      for /l %y in (0,500,2500) do @845309 2010 169 2293:99 %y:99 >nul [2010 169 2293:99 0:99] Took 0.115 seconds [2010 169 2293:99 500:99] Took 0.115 seconds [2010 169 2293:99 1000:99] Took 0.117 seconds [2010 169 2293:99 1500:99] Took 0.116 seconds [2010 169 2293:99 2000:99] Took 0.115 seconds [2010 169 2293:99 2500:99] Took 0.116 seconds for /l %y in (0,500,2500) do @845309 2010 168 2293:99 %y:99 >nul [2010 168 2293:99 0:99] Took 0.125 seconds [2010 168 2293:99 500:99] Took 0.116 seconds [2010 168 2293:99 1000:99] Took 0.114 seconds [2010 168 2293:99 1500:99] Took 0.115 seconds [2010 168 2293:99 2000:99] Took 0.115 seconds [2010 168 2293:99 2500:99] Took 0.115 seconds

      So, very simple code and very fast. And the entire uncompressed dataset (23 * 365.25 * 354MB) = < 3TB.

      With compression, that could be as little as 1.3 TB. Though you'd have a pay the price for unpacking--~30 seconds per file.

      18/06/2010 14:28 237,173,932 2010.168.bin.gz 18/06/2010 14:22 175,868,626 2010.169.bin.bz2

      But the main point of partitioning your dataset this way is that you reduce the search space to 1/8th of 1% as soon as you specify the year/day. And there is no searching of indexes involved in the rest of the query. Just a simple calculation and a direct seek.

      Anyway, t'is your data and your employers money :)

      The (flawed) test demo code.

      #! perl -slw use strict; use Time::HiRes qw[ time ]; use constant { XMAX => 4587, YMAX => 2889, REC_SIZE => 7 * 4, }; my( $year, $day, $xRange, $yRange ) = @ARGV; my( $xStart, $xEnd ) = split ':', $xRange $xEnd += $xStart; my( $yStart, $yEnd ) = split ':', $yRange; $yEnd += $yStart; my $start = time; open BIN, '<:perlio', "tmp/$year.$day.bin" or die $!; binmode BIN; my $xLen = ( $xEnd - $xStart + 1 ) * REC_SIZE; for my $y ( $yStart .. $yEnd ) { my $pos = ( $y * XMAX * REC_SIZE ) + $xStart * REC_SIZE; seek BIN, $pos, 0; my $read = sysread( BIN, my $rec, $xLen ) or die $!; my @recs = unpack '(A28)*', $rec; for my $x ( $xStart .. $xEnd ) { my( $a, $b, $c, $d, $e, $f, $g ) = unpack 'N7', $recs[ $x - $x +Start ]; printf "%4d.%03d : %10u %10u %10u %10u %10u %10u %10u\n", $year, $day, $a, $b, $c, $d, $e, $f, $g//0; } } close BIN; printf STDERR "[@ARGV] Took %.3f seconds\n", time() - $start;

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        BrowserUk>  I dummied up two days worth of data files:
        
        Could you please share your code to create the dummy data? One, I would see what your data look like, and two, I would learn a few tricks meself.
        BrowserUk> With compression, that could be as little as 1.3 TB. 
        BrowserUk> Though you'd have a pay the price for unpacking--
        BrowserUk> ~30 seconds per file.
        
        RIght. Good to keep in mind, but, not interested in paying the 30 seconds price for optimizing for space. Space is not a constraining factor, for now.
        BrowserUk> But the main point of partitioning your dataset 
        BrowserUk> this way is that you reduce the search space to 
        BrowserUk> 1/8th of 1% as soon as you specify the year/day. 
        BrowserUk> And there is no searching of indexes involved in 
        BrowserUk> the rest of the query. Just a simple calculation 
        BrowserUk> and a direct seek.
        
        Of course, I have to add the cost of looking up the spatial extent, which I have to do via the Pg database first. But, that can give me the area I want to pluck out of my image, and then work with it.
        BrowserUk> Anyway, t'is your data and your employers money :)
        
        True. This is research, so trying out different ways is a worthwhile exercise in itself. I don't get paid too much, so it is not a lot of money on my employer's part. ;-). Nevertheless, thanks much. This is fantastic. Always a pleasure to ask and learn different approaches to solving a problem.
        --

        when small people start casting long shadows, it is time to go to bed

        I'm not sure how large the cost differences are (CPU decompression vs. file I/O), but there also is Compress::LZF, which claims to be almost as fast a simple memcopy - maybe it provides enough compression to outweigh the disk I/O.