in reply to Netflix (or on handling large amounts of data efficiently in perl)
I had a play with the data back when the NetFlix challenge was first mentioned here.
Simply stated, the challenge dataset requires a minimum of: 17,770 * 480,189 * 3 (n where 2**n >= 5) / 8 = 2.98GB of data.
Which is greater than most (any?) 32-bit system can accommodate in memory.
The alternatives, performing complex joins (directly or indirectly), using disk-based storage--I tried RDBMS(pgsql), Berkeley; SqlLite3; and custom Memory-mapped files using Inline::C--all rendered the experimentation process so laborious, that it would tie up my 2.6GHz single cpu machine for 33(MMAP); 40 (Berkeley) to 100+ (MySQL) hours.
I abandoned my attempts because I concluded that life was too short to wait 2 to 4 days to discover whether the latest tweak to my algorithm was an improvement or not. That without access to either a 64-bit machine with at least 8GB of ram, or a cluster of 4x32-bit machines and suitable operating software, the cycle time was simply too long to sustain my interest.
The funny thing about the sample dataset is that there is (statistically) no reason for it to be so large. A much smaller dataset would have been equally valid as a means of sampling algorithms. I tried producing a representative sub-sample of the supplied sample, to speed up the development cycle, but sub-sampling samples is a notoriously hit & miss affair--tending as it does to emphasis any bias in the original sample. My attempts failed to produce a subsample that evaluated representatively of the supplied sample. It's almost as if the supplied sample was chosen to preclude serious participation by the lone developer.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Netflix (or on handling large amounts of data efficiently in perl)
by Garp (Acolyte) on Dec 30, 2008 at 04:37 UTC | |
by BrowserUk (Patriarch) on Dec 30, 2008 at 04:44 UTC | |
by Garp (Acolyte) on Dec 30, 2008 at 08:06 UTC |