in reply to Strategy for managing a very large database with Perl
I'd store it as video. Bear with me.
If we take a look at your table definition, we can exclude (for now) t_id and point_id as they are just a part of the RDBMS data representation rather than the actual data. Year and day effectively denote 'frame'. That leaves 3 ints and 3 reals.
Assuming (I know foolish), that 2 of those reals represent X & Y coordinates. We might also think of the 3 integers as being roughly equivalent to RGB. and the last as gamma (or alpha).
20 years of days is just over 7000 frames, which at say 30 frames per second equates to a video clip of just over 4 minutes.
100 billion rows / 7000 frames (assuming a roughly 16/9 frame shape) gives roughly a 6000 x 2500 frame size. That's approximately 10x the size of an HD video frame. So 40 minutes of HD movie. And 1 hour of HD movie is about 1 GB to download.
Now, that's mostly an academic exercise as whilst video codec compression is highly developed, it is lossy, and so not really much good for academic data. However, it could lead to thinking about a different way of storing your data that might both reduce the storage requirements and lend itself to the type of queries that will be made of it.
For example, if your data can be represented by a (say 6000 x 2500) 2D array of 3 ints and 1 float per point per day. Then that equates to a 6e3 x 2.5e3 x 16-bytes frame per day. Or 228 MB raw. * 20 years (7000) = 1.6 TB. Much less than your 10 TB projection.
Where has the savings come from? First we've moved the year/day information out of every record. It's now just stored once in the name of the file. We've also moved the X & Y coordinates out of every record as they are now implicit in the position of the data within the file.
And your typical queries are straight forward: "the values of a <variable> for a given point or a given set of points for a particular year+yday combination.", becomes: locate the appropriate file (date), and a direct read of 16 bytes from a directly calculated offset for each point or given set of points.
No data loss, just a more compact representation that lends itself directly to the types of queries anticipated.
Of course, once you store the data this way, you can then use some other video techniques to compress it further--without loss. You might for example store a "full frame" for the first day of every month, and then a set of { X,Y new value(s) } changes per day in between. This might make sense if only a small subset of the full frame is likely to change on any given day. (Much like video phones transmit change sets that are applied to periodic full frames.)
The full dataset for any given date can quickly be reconstructed (losslessly) by grabbing the preceding "full frame" and then applying the change sets to bring it up to date. And if the particular query being satisfied only applies to a small area of the full frame then pre-filtering the change sets speeds the processing.
Finally, if the 3 ints and the real for each X,Y position are compatible with representation as RGB & gamma (or one of the other color encodings), then displaying (sub-areas of) the frames as a movie would be a powerful way of visualising the changes in the datasets over time.
Just a thought that normalised relational forms are often not the best way of storing and manipulating datasets.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Strategy for managing a very large database with Perl (Video)
by punkish (Priest) on Jun 18, 2010 at 12:52 UTC | |
by BrowserUk (Patriarch) on Jun 18, 2010 at 14:45 UTC | |
by punkish (Priest) on Jun 18, 2010 at 15:05 UTC | |
by BrowserUk (Patriarch) on Jun 18, 2010 at 15:45 UTC | |
by Corion (Patriarch) on Jun 18, 2010 at 16:06 UTC | |
by BrowserUk (Patriarch) on Jun 18, 2010 at 16:22 UTC |