comment on

Using a database (whether RDBMS or other) won't help you either save diskspace or improve performance.

If you write your binary data to them as BLOBs of some type, where each BLOB represents one file.
Each blob will, in most DBs, be stored either as a separate file within the host filing system. A million files, a million clustering roundups. No savings.
Or as a fixed size (maximum size for the type of BLOB) chunk within a larger file. Thus, effectively making the cluster size, whatever the maximum size is for the largest file you expect to store.
If you store your numbers as individual rows in a table per file. You will have a million tables, which often as not translates to a million files in the host filing system.
But worse, to be able to retrieve those numbers by position, will require a second field in each row to record the position within the file. Thus, at least doubling the space requirement. More if you actually make that position field an index to speed access.

Building your own index is equally unlikely to help. It takes at least a 4-byte integer to index a 4-byte integer. Plus some way of indicating which file each belongs to. With a million files, that a least 20 bits per. And you still have to store the data.

I would use a single file with a fixed size chunk allocated to each file and store this in a compressing filesystem. (Or a sparse filesystem if you have one available.)

I just wrote a 1_000_000 x 4096 byte records, each containing a random number (0--1023) of random integers. The notionally 3.81 GB (4,096,000,000) file, actually occupies 2.42 GB of disc space. So even though potentially half of every 'file' is empty, the compression compenates.

It runs somewhat more slowly both the initial creation (I preallocated continguous space), and random access, than an uncompressed file, but not by much thanks to filesystem buffering. In any case, it will be considerably quicker than access via a RDBMS.

Even if your files can vary widely in used size, nulling the whole file before you start will allow the compression mechanism to reduce the 'wasted' space to a minimum. A 10 GB file containing only nulls requires less that 40MB to store.

The best bit is that using a single file saves a million directory entries in the filesystem, and having to juggle a million filehandles with associated system buffers and data structures in RAM. A nice saving. You will have to remember the 'append point' for each of the files, but that is just a million 4/8 bytes numbers. A single file of 4/8 MB.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon

In reply to Re: Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?)( A DB won't help) by BrowserUk
in thread Combining Ultra-Dynamic Files to Avoid Clustering (Ideas?) by rjahrman

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.