I had a play with the data back when the NetFlix challenge was first mentioned here.

Simply stated, the challenge dataset requires a minimum of: 17,770 * 480,189 * 3 (n where 2**n >= 5) / 8 = 2.98GB of data.

Which is greater than most (any?) 32-bit system can accommodate in memory.

The alternatives, performing complex joins (directly or indirectly), using disk-based storage--I tried RDBMS(pgsql), Berkeley; SqlLite3; and custom Memory-mapped files using Inline::C--all rendered the experimentation process so laborious, that it would tie up my 2.6GHz single cpu machine for 33(MMAP); 40 (Berkeley) to 100+ (MySQL) hours.

I abandoned my attempts because I concluded that life was too short to wait 2 to 4 days to discover whether the latest tweak to my algorithm was an improvement or not. That without access to either a 64-bit machine with at least 8GB of ram, or a cluster of 4x32-bit machines and suitable operating software, the cycle time was simply too long to sustain my interest.

The funny thing about the sample dataset is that there is (statistically) no reason for it to be so large. A much smaller dataset would have been equally valid as a means of sampling algorithms. I tried producing a representative sub-sample of the supplied sample, to speed up the development cycle, but sub-sampling samples is a notoriously hit & miss affair--tending as it does to emphasis any bias in the original sample. My attempts failed to produce a subsample that evaluated representatively of the supplied sample. It's almost as if the supplied sample was chosen to preclude serious participation by the lone developer.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re: Netflix (or on handling large amounts of data efficiently in perl) by BrowserUk
in thread Netflix (or on handling large amounts of data efficiently in perl) by Garp

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.