Re^2: Strategy for managing a very large database with Perl (Video)

BrowerUk> I'd store it as video. Bear with me.

First, a wonderfully whacked out suggestion. I wouldn't expect any less from you, BrowserUk, as you always come up with an intriguing possibility.

A few specific responses: First, a series of video frames is not that outlandish, actually. A video frame is nothing but a series of 2D images that flutter by our eyes at 30 fps. So, it is akin to storing a bunch of 2D images. Doing locational searches is, as you suggested, finding the correct file, and then finding the subset from that file.

But, these would be a lot of very large images... a pretty bad combination. My dataset's 2D dims are 4,587 X 2,889 = 13,251,843 points. 23 years is 23 * 365 (or 366) = 8395 (or 8418) images. Plus, each "pixel" in the image is an array of my 6 or 7 values. So, we are back to complexity in retrieval.

See my response to moritz at Re^2: Strategy for managing a very large database with Perl. What am I optimizing for?

Space? I don't really care about space, because disk drive is a lot cheaper than my time.
Integrity? Once loaded, the data are read-only.
Ease and speed of retrieval? Yes and Yes.

I only care about how easily and quickly I can get data that I want. Say, I want to create an image of variable a over an area for a particular day. That is a simple db query SELECT var FROM table WHERE year = ? AND yday = ?. Note: A columnar database such as Vertica or MonetDB might be very speed efficient for these queries, but those db lack spatial searches, and suffer from the lack of ease aspect.

Nevertheless, your idea is very intriguing, and I am going to do a few tests with array storage. Of course, as mentioned in an earlier posting, the data are already in NetCDF, an array storage format, so I have to do something new that overcomes the shortcomings of the current format. The main shortcoming of the current format is the inability to do spatial searches for arbitrary spatial bounds.

Thanks again, for a wonderful response.

--

when small people start casting long shadows, it is time to go to bed

Comment on Re^2: Strategy for managing a very large database with Perl (Video) Download Code

Replies are listed 'Best First'.
Re^3: Strategy for managing a very large database with Perl (Video) by BrowserUk (Patriarch) on Jun 18, 2010 at 14:45 UTC
Plus, each "pixel" in the image is an array of my 6 or 7 values. So, we are back to complexity in retrieval. I don't know why you say that? The code below plucks a rectangle of data points of a specified size from a specified year/day "image". I dummied up two days worth of data files: `C:\test>dir tmp 18/06/2010 14:28 371,260,960 2010.168.bin 18/06/2010 14:22 371,260,960 2010.169.bin 2 File(s) 742,521,920 bytes` [download] And this shows the code plucking 10 x 10 x 7 datasets from various positions within each of those files (with the output redirected for clarity). The code is just a little math, a read and an unpack--most of the posted code is just parsing the arguments and formatting the output and timing: for /l %y in (0,500,2500) do @845309 2010 169 2293:9 %y:9 >nul [2010 169 2293:9 0:9] Took 0.020 seconds [2010 169 2293:9 500:9] Took 0.017 seconds [2010 169 2293:9 1000:9] Took 0.017 seconds [2010 169 2293:9 1500:9] Took 0.017 seconds [2010 169 2293:9 2000:9] Took 0.019 seconds [2010 169 2293:9 2500:9] Took 0.017 seconds for /l %y in (0,500,2500) do @845309 2010 168 2293:9 %y:9 >nul [2010 168 2293:9 0:9] Took 0.021 seconds [2010 168 2293:9 500:9] Took 0.017 seconds [2010 168 2293:9 1000:9] Took 0.017 seconds [2010 168 2293:9 1500:9] Took 0.066 seconds [2010 168 2293:9 2000:9] Took 0.023 seconds [2010 168 2293:9 2500:9] Took 0.017 seconds [download] And here 100 x 100 x 7 data points. Very linear as expected. for /l %y in (0,500,2500) do @845309 2010 169 2293:99 %y:99 >nul [2010 169 2293:99 0:99] Took 0.115 seconds [2010 169 2293:99 500:99] Took 0.115 seconds [2010 169 2293:99 1000:99] Took 0.117 seconds [2010 169 2293:99 1500:99] Took 0.116 seconds [2010 169 2293:99 2000:99] Took 0.115 seconds [2010 169 2293:99 2500:99] Took 0.116 seconds for /l %y in (0,500,2500) do @845309 2010 168 2293:99 %y:99 >nul [2010 168 2293:99 0:99] Took 0.125 seconds [2010 168 2293:99 500:99] Took 0.116 seconds [2010 168 2293:99 1000:99] Took 0.114 seconds [2010 168 2293:99 1500:99] Took 0.115 seconds [2010 168 2293:99 2000:99] Took 0.115 seconds [2010 168 2293:99 2500:99] Took 0.115 seconds [download] So, very simple code and very fast. And the entire uncompressed dataset (23 * 365.25 * 354MB) = < 3TB. With compression, that could be as little as 1.3 TB. Though you'd have a pay the price for unpacking--~30 seconds per file. `18/06/2010 14:28 237,173,932 2010.168.bin.gz 18/06/2010 14:22 175,868,626 2010.169.bin.bz2` [download] But the main point of partitioning your dataset this way is that you reduce the search space to 1/8th of 1% as soon as you specify the year/day. And there is no searching of indexes involved in the rest of the query. Just a simple calculation and a direct seek. Anyway, t'is your data and your employers money :) The (flawed) test demo code. #! perl -slw use strict; use Time::HiRes qw[ time ]; use constant { XMAX => 4587, YMAX => 2889, REC_SIZE => 7 * 4, }; my( $year, $day, $xRange, $yRange ) = @ARGV; my( $xStart, $xEnd ) = split ':', $xRange $xEnd += $xStart; my( $yStart, $yEnd ) = split ':', $yRange; $yEnd += $yStart; my $start = time; open BIN, '<:perlio', "tmp/$year.$day.bin" or die $!; binmode BIN; my $xLen = ( $xEnd - $xStart + 1 ) * REC_SIZE; for my $y ( $yStart .. $yEnd ) { my $pos = ( $y * XMAX * REC_SIZE ) + $xStart * REC_SIZE; seek BIN, $pos, 0; my $read = sysread( BIN, my $rec, $xLen ) or die $!; my @recs = unpack '(A28)*', $rec; for my $x ( $xStart .. $xEnd ) { my( $a, $b, $c, $d, $e, $f, $g ) = unpack 'N7', $recs[ $x - $x +Start ]; printf "%4d.%03d : %10u %10u %10u %10u %10u %10u %10u\n", $year, $day, $a, $b, $c, $d, $e, $f, $g//0; } } close BIN; printf STDERR "[@ARGV] Took %.3f seconds\n", time() - $start; [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^4: Strategy for managing a very large database with Perl (Video) by punkish (Priest) on Jun 18, 2010 at 15:05 UTC
BrowserUk> I dummied up two days worth of data files: Could you please share your code to create the dummy data? One, I would see what your data look like, and two, I would learn a few tricks meself. BrowserUk> With compression, that could be as little as 1.3 TB. BrowserUk> Though you'd have a pay the price for unpacking-- BrowserUk> ~30 seconds per file. RIght. Good to keep in mind, but, not interested in paying the 30 seconds price for optimizing for space. Space is not a constraining factor, for now. BrowserUk> But the main point of partitioning your dataset BrowserUk> this way is that you reduce the search space to BrowserUk> 1/8th of 1% as soon as you specify the year/day. BrowserUk> And there is no searching of indexes involved in BrowserUk> the rest of the query. Just a simple calculation BrowserUk> and a direct seek. Of course, I have to add the cost of looking up the spatial extent, which I have to do via the Pg database first. But, that can give me the area I want to pluck out of my image, and then work with it. BrowserUk> Anyway, t'is your data and your employers money :) True. This is research, so trying out different ways is a worthwhile exercise in itself. I don't get paid too much, so it is not a lot of money on my employer's part. ;-). Nevertheless, thanks much. This is fantastic. Always a pleasure to ask and learn different approaches to solving a problem. -- when small people start casting long shadows, it is time to go to bed	[reply]
Re^5: Strategy for managing a very large database with Perl (Video) by BrowserUk (Patriarch) on Jun 18, 2010 at 15:45 UTC
The dummied up data is just a bunch of random 32-bit integers. 7 per X,Y point in your 4587 x 2889 2D dataspace. `open BIN, '>:raw', 'tmp/2010.168.bin';; for my $y ( 0 .. 2889 ) { print $y; for my $x ( 0 .. 4587 ) { printf BIN "%s", pack 'V', map int( rand 232 ), 1..7; } };; close BIN;;` [download] You've never specified the actual range of those "6 or 7 variables"* so I just went with 7 x 32-bits. The size of the datafiles will vary accordingly. But they could just as easily be any other datatype you care to specify. You could read those back as 32-bit floats and it would make no difference to the size of the files or the performance. (Although there are probably quite a few NaNs and +/-INFs and such in there.) And if the size of the data types doubles, so does the size of the file; but the performance remains constant. Space is not a constraining factor, for now. There is another reason beyond the cost of disks for keeping data compact. Simply that of reading more bytes, requires more time. Especially if the data has to come from remote storage. But the real benefit of binary storage is fixed size. That means you can encode the spatial information directly in the file position. That a reduces the data stored--no x,y in every one of your 100 billion records (1 TB saved right there). That's 1 TB you'll never have to read! It also means that you don't need to build and maintain indexes on fields X & Y. So instead of: Seek disk for index; load index; search index; locate record pointer; seek to record; read record; It's just seek yxMax+xrecSize; read recSize; You save the space of the index; and the time processing it; and (at minimum) 2 extra seeks. It might not sound much, but if you (your program) are remote from the DB engine (that you are competing for the attention of with other applications), and it is remote from the actual storage holding the data; it adds up Not to mention all the parsing of SQL; record locking; housekeeping etc. that a GP RDBMS has to deal with. RDBMSs are clever; they are not magic. If you know your data; and you know what you need to do with your data; it is nearly always possible to knock RDBMSs into a cocked hat for performance. Simply because you don't have to be "general purpose". Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]
Re^4: Strategy for managing a very large database with Perl (Video) by Corion (Patriarch) on Jun 18, 2010 at 16:06 UTC
I'm not sure how large the cost differences are (CPU decompression vs. file I/O), but there also is Compress::LZF, which claims to be almost as fast a simple memcopy - maybe it provides enough compression to outweigh the disk I/O.	[reply]
Re^5: Strategy for managing a very large database with Perl (Video) by BrowserUk (Patriarch) on Jun 18, 2010 at 16:22 UTC
The problem with compression is that it screws up random access. I probably shouldn't even have mentioned it. Uncompressed, you can read the 28-bytes associated with any given "pixel" in the twinkling of an eye. If you have to read and decompress--even in memory--the entire file to get at each pixel, it isn't going to help performance. And the OP says he is unconcerned with space. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply]