in reply to Netflix (or on handling large amounts of data efficiently in perl)

I had a play with the data back when the NetFlix challenge was first mentioned here.

Simply stated, the challenge dataset requires a minimum of: 17,770 * 480,189 * 3 (n where 2**n >= 5) / 8 = 2.98GB of data.

Which is greater than most (any?) 32-bit system can accommodate in memory.

The alternatives, performing complex joins (directly or indirectly), using disk-based storage--I tried RDBMS(pgsql), Berkeley; SqlLite3; and custom Memory-mapped files using Inline::C--all rendered the experimentation process so laborious, that it would tie up my 2.6GHz single cpu machine for 33(MMAP); 40 (Berkeley) to 100+ (MySQL) hours.

I abandoned my attempts because I concluded that life was too short to wait 2 to 4 days to discover whether the latest tweak to my algorithm was an improvement or not. That without access to either a 64-bit machine with at least 8GB of ram, or a cluster of 4x32-bit machines and suitable operating software, the cycle time was simply too long to sustain my interest.

The funny thing about the sample dataset is that there is (statistically) no reason for it to be so large. A much smaller dataset would have been equally valid as a means of sampling algorithms. I tried producing a representative sub-sample of the supplied sample, to speed up the development cycle, but sub-sampling samples is a notoriously hit & miss affair--tending as it does to emphasis any bias in the original sample. My attempts failed to produce a subsample that evaluated representatively of the supplied sample. It's almost as if the supplied sample was chosen to preclude serious participation by the lone developer.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."
  • Comment on Re: Netflix (or on handling large amounts of data efficiently in perl)

Replies are listed 'Best First'.
Re^2: Netflix (or on handling large amounts of data efficiently in perl)
by Garp (Acolyte) on Dec 30, 2008 at 04:37 UTC

    Actually, if you look at icefox's framework it will do it's basic average run in:

    real 0m1.348s
    user 0m0.100s
    sys 0m0.400s

    That's running on an Ubuntu server inside a VirtualBox VM, so just a single core of a T2060 (1.6Ghz Core 2 Duo.)

    Some of the more complicated averages naturally take more time, but it's still in the realm of reasonable for home users.

      Sorry. I couldn't be bothered to sift through the 500+ links that googling "icefox netflix framework" threw up.

      If the code is available somewhere, why not post a direct link?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.