Update: Your "This algorithm" link target has changed (presumably following Consideration). The content it now links to is completely different from the previous content linked to. Ignore any comments I've made which no longer make any sense (I think I've striken most, if not all, of them). An Update in your post would have been useful!

"Will cause the entire file to be read through a tiny buffer."

The default read cache size is 2MiB. Is this what you're referring to as "a tiny buffer? It can be changed with the memory option: what size would you suggest?

"... will require the disk heads to shuffle back and forth all over the disk to locate the randomly chosen records."

I suspect this is a reference to your "This algorithm" link: see the next point.

"This algorithm [Link removed - see update above] will fairly pick random lines from using a single pass over that file."

That links to "Perl Cookbook: Recipe 8.6. Picking a Random Line from a File". The OP wants to pick each line at random, not just one line at random.

[By the way, the URL associated with that link (i.e. http://docstore.mik.ua/orelly/perl/cookbook/ch08_07.htm) is questionable: it's O'Reilly material provided by someone else. I've seen arguments for and against this specific one, so I'm just pointing it out.]

I have a 10Gb test file (100,000,000 records of 100 bytes each) which I used to test my solution. Not unsurprisingly, this data bore no resemblance to the OP data so I added some additional code to create dummy chromosome and position values:

#my ($chr, $pos) = (split ' ', $locations[$indexes[$rand_index]])[ +0, 1]; my ($chr, $pos) = (split '', $locations[$indexes[$rand_index]])[0, + 1]; $chr = 'chromosome' . ($_ % 4 + 1); $pos = $_ + 10;

I also changed the subset size from my original test value of 4 to the OP's example of 500.

With twice the expected file size and additional processing, this took almost exactly 40 minutes.

I tried another solution (without Tie::File) using tell to create the index and seek to locate the wanted random records: beyond this, the rest of the code was unchanged. This solution also took almost exactly 40 minutes.

While other solutions may be faster, I don't consider this to be "horribly slow" or exhibiting "abysmal" performance.

-- Ken


In reply to Re^3: Building a new file by filtering a randomized old file on two fields by kcott
in thread Building a new file by filtering a randomized old file on two fields by mbp

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.