in reply to In-place sort with order assignment

Could I perchance repeat my suggestion of, “do it the COBOL way?”

External disk-based sorting ... easily done in Perl ... can accommodate millions of records, and runs with counter-intuitive speed.   When this is done, to any data set, you know two things:

  1. All records having any particular key value are adjacent.
  2. Any time you encounter a gap in the key-sequence, you know that nothing’s in it, and you know exactly how big the gap is.   All without “searching.”

Therefore, work with a sorted file, then also sort your input file(s) in the same sequence.   Having done so, you can now process the two files together, sequentially, without “searching.”

In the early days of movies, anytime Hollywood wanted to film a real digital computer, the operators would start a tape-sort.   All those spinning tapes made wonderful footage.   You may or may not have noticed that they never slowed down or ran backwards.   They processed very large amounts of data, very efficiently, doing so without large amounts of RAM or random-access storage, neither one of which was available at the time.   You can expect “more than hundred-fold increases” in efficiency for many algorithms ... including the time spent doing the sorts, which is “far less time than you think.”

Never forget that “memory is a disk-file, too.”

“Don’t diddle the code to make it faster:   choose a better algorithm.”
-- Kernighan & Plaugher; The Elements of Programming Style.

Replies are listed 'Best First'.
Re^2: In-place sort with order assignment
by BrowserUk (Patriarch) on Sep 21, 2010 at 15:07 UTC
    Could I perchance repeat my suggestion of, “do it the COBOL way?”

    No. I read you the first time.

    If you read the rest of the thread, you'd see that option has already been tested and comprehensively dismissed as a possibility.

    As for all the "old-timer" logic. Been there, done that. (Don't ever wanna go back:).

    In the mid-80's I sped up the sorting of 6 million University entrance examination results on twinned PDP-11/24s from 10 days to just over a day (by reversing the order of the joins involved).

    Today, I've sorted 60% more records, in perl, in 108 seconds.

    I can't start work with a sorted file, because the input file needs to be pre-processed to extract the 1 billion words from the 100 million lines. Those 1 billion words need to be uniq'd to reduce their volume by roughly 3 orders of magnitude before sorting.

    I could (externally) sort the 1 billion words; then uniq them, but that means O(N log N) where N is 1 billion, rather than O(N log N) where N = 1 million. Now assuming that each record takes 0.0000113 to sort--that's based upon my Perl sort test, no external sort is going to achieve that--the trade of sorting before uniq'ing versus after is:

    1. 1e9 * log 1e9 * 0.0000113 = 28.25 hours.
    2. 1e6 * log 1e6 * 0.0000113 = 1 minute 7.8 seconds.

    Now, those per item timings are based upon an in-memory sort. To get a true picture we'd need to use figures from an external sort. So, I sorted 1 million items and the sort time was 0.00001535; then I sort 10 million items and the time was 0.00002198. That's a 43% increase in time per item, for ten times the data.

    So for 100 million, say 0.00003143. And for 1 billion say: 0.00004494. Which means externally sorting 1 billion items is likely to be closer to 5 days than 4. I hope you now see that uniq'ing before the sort is imperative.

    Just as not everything new is better than the old, vice versa is also true.

    It's simple physics. If you squeeze a large volume of data (~10GB), through a small hole (~1MB ram as used by GNU sort), you have to shuffle it on and off disc many times. "memory may be a disc too", but it's bloody fast compared to real disc. Conversely, disc makes bloody lousy memory.

    “Don’t diddle the code to make it faster: choose a better algorithm.”

    Uniq'ing then sorting in memory: 108 seconds.

    Pre-external sorting and then uniq'ing: 4.68125 days.

    Which do you see as the better algorithm?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      As I read through this post, a couple of my own posts popped into my head (What Is A Word? and How many words does it take?). When you say "word" and "line", do you mean englishish text? I would be pretty suprised to see a dictionary of 1+ million real words but I guess it is possible. If this process has to be repeated multiple times and if most, if not all, words from one run to the next will be seen again, then perhaps there is an even faster way of doing this (using a pre-built dictionary).

      Cheers - L~R

        For "words" read 'searchable terms'.

        Which would include strings of digits; proper nouns; and in a generic solution, any language.

        For testing, I'm currently using split /[^A-Za-z0-9.'+-]+/, $line; but ultimately that should be configurable.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.