Re^4: Strategy for managing a very large database with Perl (Video)

BrowserUk>  I dummied up two days worth of data files:

Could you please share your code to create the dummy data? One, I would see what your data look like, and two, I would learn a few tricks meself.

BrowserUk> With compression, that could be as little as 1.3 TB. 
BrowserUk> Though you'd have a pay the price for unpacking--
BrowserUk> ~30 seconds per file.

RIght. Good to keep in mind, but, not interested in paying the 30 seconds price for optimizing for space. Space is not a constraining factor, for now.

BrowserUk> But the main point of partitioning your dataset 
BrowserUk> this way is that you reduce the search space to 
BrowserUk> 1/8th of 1% as soon as you specify the year/day. 
BrowserUk> And there is no searching of indexes involved in 
BrowserUk> the rest of the query. Just a simple calculation 
BrowserUk> and a direct seek.

Of course, I have to add the cost of looking up the spatial extent, which I have to do via the Pg database first. But, that can give me the area I want to pluck out of my image, and then work with it.

BrowserUk> Anyway, t'is your data and your employers money :)

True. This is research, so trying out different ways is a worthwhile exercise in itself. I don't get paid too much, so it is not a lot of money on my employer's part. ;-). Nevertheless, thanks much. This is fantastic. Always a pleasure to ask and learn different approaches to solving a problem.

--

when small people start casting long shadows, it is time to go to bed

Comment on Re^4: Strategy for managing a very large database with Perl (Video)

Replies are listed 'Best First'.
Re^5: Strategy for managing a very large database with Perl (Video) by BrowserUk (Patriarch) on Jun 18, 2010 at 15:45 UTC
The dummied up data is just a bunch of random 32-bit integers. 7 per X,Y point in your 4587 x 2889 2D dataspace. `open BIN, '>:raw', 'tmp/2010.168.bin';; for my $y ( 0 .. 2889 ) { print $y; for my $x ( 0 .. 4587 ) { printf BIN "%s", pack 'V', map int( rand 232 ), 1..7; } };; close BIN;;` [download] You've never specified the actual range of those "6 or 7 variables"* so I just went with 7 x 32-bits. The size of the datafiles will vary accordingly. But they could just as easily be any other datatype you care to specify. You could read those back as 32-bit floats and it would make no difference to the size of the files or the performance. (Although there are probably quite a few NaNs and +/-INFs and such in there.) And if the size of the data types doubles, so does the size of the file; but the performance remains constant. Space is not a constraining factor, for now. There is another reason beyond the cost of disks for keeping data compact. Simply that of reading more bytes, requires more time. Especially if the data has to come from remote storage. But the real benefit of binary storage is fixed size. That means you can encode the spatial information directly in the file position. That a reduces the data stored--no x,y in every one of your 100 billion records (1 TB saved right there). That's 1 TB you'll never have to read! It also means that you don't need to build and maintain indexes on fields X & Y. So instead of: Seek disk for index; load index; search index; locate record pointer; seek to record; read record; It's just seek yxMax+xrecSize; read recSize; You save the space of the index; and the time processing it; and (at minimum) 2 extra seeks. It might not sound much, but if you (your program) are remote from the DB engine (that you are competing for the attention of with other applications), and it is remote from the actual storage holding the data; it adds up Not to mention all the parsing of SQL; record locking; housekeeping etc. that a GP RDBMS has to deal with. RDBMSs are clever; they are not magic. If you know your data; and you know what you need to do with your data; it is nearly always possible to knock RDBMSs into a cocked hat for performance. Simply because you don't have to be "general purpose". Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]

Replies are listed 'Best First'.

Re^5: Strategy for managing a very large database with Perl (Video)
by BrowserUk (Patriarch) on Jun 18, 2010 at 15:45 UTC

The dummied up data is just a bunch of random 32-bit integers. 7 per X,Y point in your 4587 x 2889 2D dataspace.

open BIN, '>:raw', 'tmp/2010.168.bin';;

for my $y ( 0 .. 2889 ) { 
    print $y; 
    for my $x ( 0 .. 4587 ) { 
        printf BIN "%s", pack 'V*', map int( rand 2**32 ), 1..7; 
    } 
};;

close BIN;;
[download]

You've never specified the actual range of those "6 or 7 variables" so I just went with 7 x 32-bits. The size of the datafiles will vary accordingly.

But they could just as easily be any other datatype you care to specify. You could read those back as 32-bit floats and it would make no difference to the size of the files or the performance. (Although there are probably quite a few NaNs and +/-INFs and such in there.) And if the size of the data types doubles, so does the size of the file; but the performance remains constant.

Space is not a constraining factor, for now.

There is another reason beyond the cost of disks for keeping data compact. Simply that of reading more bytes, requires more time. Especially if the data has to come from remote storage.

But the real benefit of binary storage is fixed size. That means you can encode the spatial information directly in the file position. That a reduces the data stored--no x,y in every one of your 100 billion records (1 TB saved right there). That's 1 TB you'll never have to read!

It also means that you don't need to build and maintain indexes on fields X & Y. So instead of:

Seek disk for index; load index; search index; locate record pointer; seek to record; read record;

It's just seek y*xMax+x*recSize; read recSize;

You save the space of the index; and the time processing it; and (at minimum) 2 extra seeks. It might not sound much, but if you (your program) are remote from the DB engine (that you are competing for the attention of with other applications), and it is remote from the actual storage holding the data; it adds up

Not to mention all the parsing of SQL; record locking; housekeeping etc. that a GP RDBMS has to deal with. RDBMSs are clever; they are not magic. If you know your data; and you know what you need to do with your data; it is nearly always possible to knock RDBMSs into a cocked hat for performance. Simply because you don't have to be "general purpose".

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

[reply]
[d/l]