in reply to Re^4: Strategy for managing a very large database with Perl (Video)
in thread Strategy for managing a very large database with Perl

The dummied up data is just a bunch of random 32-bit integers. 7 per X,Y point in your 4587 x 2889 2D dataspace.

open BIN, '>:raw', 'tmp/2010.168.bin';; for my $y ( 0 .. 2889 ) { print $y; for my $x ( 0 .. 4587 ) { printf BIN "%s", pack 'V*', map int( rand 2**32 ), 1..7; } };; close BIN;;

You've never specified the actual range of those "6 or 7 variables" so I just went with 7 x 32-bits. The size of the datafiles will vary accordingly.

But they could just as easily be any other datatype you care to specify. You could read those back as 32-bit floats and it would make no difference to the size of the files or the performance. (Although there are probably quite a few NaNs and +/-INFs and such in there.) And if the size of the data types doubles, so does the size of the file; but the performance remains constant.

Space is not a constraining factor, for now.

There is another reason beyond the cost of disks for keeping data compact. Simply that of reading more bytes, requires more time. Especially if the data has to come from remote storage.

But the real benefit of binary storage is fixed size. That means you can encode the spatial information directly in the file position. That a reduces the data stored--no x,y in every one of your 100 billion records (1 TB saved right there). That's 1 TB you'll never have to read!

It also means that you don't need to build and maintain indexes on fields X & Y. So instead of:

Seek disk for index; load index; search index; locate record pointer; seek to record; read record;
It's just seek y*xMax+x*recSize; read recSize;

You save the space of the index; and the time processing it; and (at minimum) 2 extra seeks. It might not sound much, but if you (your program) are remote from the DB engine (that you are competing for the attention of with other applications), and it is remote from the actual storage holding the data; it adds up

Not to mention all the parsing of SQL; record locking; housekeeping etc. that a GP RDBMS has to deal with. RDBMSs are clever; they are not magic. If you know your data; and you know what you need to do with your data; it is nearly always possible to knock RDBMSs into a cocked hat for performance. Simply because you don't have to be "general purpose".


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^5: Strategy for managing a very large database with Perl (Video)
  • Download Code