comment on

Advice on your initial load. Loading that file naively means doing 100 million writes. Each write means seeking to read, reading, then seeking to write and writing So 200 million seeks on disk. If your disk spins at 6000 RPM it spins 100 times per second, and seek time averages out to the time for it to spin halfway. So you do 200 seeks per second. That takes a million seconds, which is about a week and a half.

Suppose each record takes 100 bytes. Then you have about 10 GB of data. Sorting with a merge-sort should take about 30 passes. Each pass involves reading and writing all of your data. So that's about 600 GB of data. Supposing that your disk has a sustained throughput of 60 MB/s, that should take 10,000 seconds, or about 3 hours. Plus CPU time. Minus your savings if your sort implementation is smart enough to do some passes in RAM to minimize writing to disk. But any way you cut it, far faster.

Therefore my standard suggestion for this kind of data volume is to use the Unix sort utility to sort your dataset, then make a BerkeleyDB btree and load that. (Btrees are stored in sorted order, which gets rid of your sustained throughput problems.) This will make your initial data load take just a few hours rather than a week and a half.

As a bonus, for large datasets if there is any locality of reference in your requests (usually there is), a btree makes better use of cached data in RAM than a hash can.

In reply to Re: BerkeleyDB and a very large file by tilly
in thread BerkeleyDB and a very large file by kahn_ray@yahoo.com

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.