Could I perchance repeat my suggestion of, “do it the COBOL way?”

No. I read you the first time.

If you read the rest of the thread, you'd see that option has already been tested and comprehensively dismissed as a possibility.

As for all the "old-timer" logic. Been there, done that. (Don't ever wanna go back:).

In the mid-80's I sped up the sorting of 6 million University entrance examination results on twinned PDP-11/24s from 10 days to just over a day (by reversing the order of the joins involved).

Today, I've sorted 60% more records, in perl, in 108 seconds.

I can't start work with a sorted file, because the input file needs to be pre-processed to extract the 1 billion words from the 100 million lines. Those 1 billion words need to be uniq'd to reduce their volume by roughly 3 orders of magnitude before sorting.

I could (externally) sort the 1 billion words; then uniq them, but that means O(N log N) where N is 1 billion, rather than O(N log N) where N = 1 million. Now assuming that each record takes 0.0000113 to sort--that's based upon my Perl sort test, no external sort is going to achieve that--the trade of sorting before uniq'ing versus after is:

  1. 1e9 * log 1e9 * 0.0000113 = 28.25 hours.
  2. 1e6 * log 1e6 * 0.0000113 = 1 minute 7.8 seconds.

Now, those per item timings are based upon an in-memory sort. To get a true picture we'd need to use figures from an external sort. So, I sorted 1 million items and the sort time was 0.00001535; then I sort 10 million items and the time was 0.00002198. That's a 43% increase in time per item, for ten times the data.

So for 100 million, say 0.00003143. And for 1 billion say: 0.00004494. Which means externally sorting 1 billion items is likely to be closer to 5 days than 4. I hope you now see that uniq'ing before the sort is imperative.

Just as not everything new is better than the old, vice versa is also true.

It's simple physics. If you squeeze a large volume of data (~10GB), through a small hole (~1MB ram as used by GNU sort), you have to shuffle it on and off disc many times. "memory may be a disc too", but it's bloody fast compared to real disc. Conversely, disc makes bloody lousy memory.

“Don’t diddle the code to make it faster: choose a better algorithm.”

Uniq'ing then sorting in memory: 108 seconds.

Pre-external sorting and then uniq'ing: 4.68125 days.

Which do you see as the better algorithm?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

In reply to Re^2: In-place sort with order assignment by BrowserUk
in thread In-place sort with order assignment by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.