in reply to Re^2: Can you improve upon my algorithm.
in thread Can you improve upon my algorithm.

The thing is that once you have the chunks sorted, what the theory says is that in practice, you can not do much better than a N-way merge.

A N-way merge uses almost no memory, just enough to avoid the trashing caused by reading from N files in parallel (say 10MB per-way, for a 100GB file and 50 chunks, that means 10_000 seeks, approximately 200s in a modern disk ignoring fragmentation and FS overhead).

The computational complexity of the merge is O(MlogN) where M is the dataset size (100GB). Actually if you count the number of comparisons that have to be performed it is c = M*log2N. In your case, as N = 50, c = 8*M = 800GB. Current RAM bandwidth is around 10GB, but that is for peaks, lets assume that in practice the memory bandwidth is 1GB/s. That means that those comparisons (the bottleneck there is the RAM) would take 800GB / 1GB/s = 800s. 15 minutes!

So, where does the time goes? in reading the data from disk into memory and writing it back. With a good HDD (100MB/s) reading+writing 100GB are 2 * 100GB / 100MB/s = 2000s.

In summary, merging the 50 blocks, should take around 3000s. 1 hour! In theory!!!

  • Comment on Re^3: Can you improve upon my algorithm.

Replies are listed 'Best First'.
Re^4: Can you improve upon my algorithm.
by BrowserUk (Patriarch) on Mar 13, 2015 at 04:40 UTC
    In theory!!!!

    In practice!

    My E: drive is brand new, barely used and connected to a SATA-3 i/f.

    wc -c does almost nothing but increment a variable, and it reads a 34GB file in 192 secs, at a rate of 171MB/s:

    E:\>dir 30GB.dat 28/02/2015 08:21 34,560,000,000 30GB.dat [ 0:08:11.82] E:\>wc -c 30GB.dat 34560000000 30GB.dat [ 0:11:23.17] E:\>perl -E"say 34560000000 / 192 / 1024**2;" 171.661376953125

    I created 50 1GB files:

    E:\md junk E:\cd Junk E:\junk>perl -le"open O, '>00';printf O qq[%01023d\n],$_ for 1..1024** +2; close O; system qq[copy 00 $_] for('01'..'50')" 13/03/2015 00:23 1,074,790,400 01 13/03/2015 00:23 1,074,790,400 02 13/03/2015 00:23 1,074,790,400 03 13/03/2015 00:23 1,074,790,400 04 13/03/2015 00:23 1,074,790,400 05 13/03/2015 00:23 1,074,790,400 06 13/03/2015 00:23 1,074,790,400 07 13/03/2015 00:23 1,074,790,400 08 13/03/2015 00:23 1,074,790,400 09 13/03/2015 00:23 1,074,790,400 10 13/03/2015 00:23 1,074,790,400 11 13/03/2015 00:23 1,074,790,400 12 13/03/2015 00:23 1,074,790,400 13 13/03/2015 00:23 1,074,790,400 14 13/03/2015 00:23 1,074,790,400 15 13/03/2015 00:23 1,074,790,400 16 13/03/2015 00:23 1,074,790,400 17 13/03/2015 00:23 1,074,790,400 18 13/03/2015 00:23 1,074,790,400 19 13/03/2015 00:23 1,074,790,400 20 13/03/2015 00:23 1,074,790,400 21 13/03/2015 00:23 1,074,790,400 22 13/03/2015 00:23 1,074,790,400 23 13/03/2015 00:23 1,074,790,400 24 13/03/2015 00:23 1,074,790,400 25 13/03/2015 00:23 1,074,790,400 26 13/03/2015 00:23 1,074,790,400 27 13/03/2015 00:23 1,074,790,400 28 13/03/2015 00:23 1,074,790,400 29 13/03/2015 00:23 1,074,790,400 30 13/03/2015 00:23 1,074,790,400 31 13/03/2015 00:23 1,074,790,400 32 13/03/2015 00:23 1,074,790,400 33 13/03/2015 00:23 1,074,790,400 34 13/03/2015 00:23 1,074,790,400 35 13/03/2015 00:23 1,074,790,400 36 13/03/2015 00:23 1,074,790,400 37 13/03/2015 00:23 1,074,790,400 38 13/03/2015 00:23 1,074,790,400 39 13/03/2015 00:23 1,074,790,400 40 13/03/2015 00:23 1,074,790,400 41 13/03/2015 00:23 1,074,790,400 42 13/03/2015 00:23 1,074,790,400 43 13/03/2015 00:23 1,074,790,400 44 13/03/2015 00:23 1,074,790,400 45 13/03/2015 00:23 1,074,790,400 46 13/03/2015 00:23 1,074,790,400 47 13/03/2015 00:23 1,074,790,400 48 13/03/2015 00:23 1,074,790,400 49 13/03/2015 00:23 1,074,790,400 50 50 File(s) 53,739,520,000 bytes 2 Dir(s) 1,490,904,588,288 bytes free

    I then ran the following script which:

    1. reads 10MB chunks into separate buffers, from each of the 50 files;
    2. selects one of the buffers and trims a 16-byte record from it;
    3. loop back to 2; until one of the buffers is empty;
    4. repopulate that buffer from disk;
    5. loop back to 2; until a file is empty;
    6. close the file and remove the buffer from the array of buffers;
    7. loop back to 2; until all the buffers are empty;

    Total runtime (projection): 85 hours; average read rate: 10MB/minute or 171kb/second.

    The script:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; use constant _10MB => 10*1024**2; $|++; my @fhs; open $fhs[ $_ ], '<:raw', sprintf"%02d", $_ for 1 .. 50; shif +t @fhs; my $start = time; my @bufs; read( $fhs[ $_ ], $bufs[ $_ ], _10MB ) for 0 .. 49; OUTER: while( @fhs ) { printf "\r%d\t", scalar( @fhs ); my $r = int rand @fhs; unless( length( $bufs[ $r ] ) ) { read( $fhs[ $r ], $bufs[ $r ], _10MB ) or do{ close $fhs[ $r ]; splice @fhs, $r, 1; splice @bufs, $r, 1; next OUTER; }; } my $nextRec = substr( $bufs[ $r ], -16, 16, '' ); } my $end = time; printf "Took %d seconds.\n", $end - $start;

    Note: there is no heap or merge or insertion sort of the records being 'merged', indeed no comparisons whatsoever; and no writing to disk.

    Just the generation of a random number, the copying of 16 bytes, and the adjustment of a pointer (CUR) for each record.

    But it happens 3.3 billion times.

    With the result that there is almost no IO activity for 98% of the time, and then occasional bursts as 10MB buffers are repopulated.

    As the nature of random is to distribute evenly, the 50 x 10MB reads tend to come grouped pretty close together; roughly every 50 minutes.

    With 102 buffers per 1GB file, thats 102 * 50 = 5100 minutes or 85 hours or 3 1/2 days; which mirrors the timings I've experienced using external sort programs.

    I'm not going to let it run to completion. I hope you'll understand why.

    Update:At the point of posting this, the process had just been running for exactly 2 hours and has read 890.4MB.

    That's 7.42MB/minute which projects to 4.8 days to complete the 50GB.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      Oh, well, I meant if you do it in C, Perl has too much overhead for that.

      For instance, removing everything not in the hot path for your script:

      while( @fhs ) { printf "\r%d\t", scalar( @fhs ); my $r = int rand @fhs; 57 unless( length( $bufs[ $r ] ) ); my $nextRec = substr( $bufs[ $r ], -16, 16, '' ); }

      Perl compiles it into 40 OPs, 35 inside the loop:

      $ perl -MO=Concise,-exec buk.pl 1 <0> enter 2 <;> nextstate(main 5 buk.pl:1) v:{ 3 <{> enterloop(next->z last->13 redo->4) v 10 <#> gv[*fhs] s 11 <1> rv2av[t2] sK/1 12 <|> and(other->4) vK/1 4 <;> nextstate(main 1 buk.pl:2) v 5 <0> pushmark sM 6 <$> const[PV "\r%d\t"] sM 7 <#> gv[*fhs] s 8 <1> rv2av[t5] sK/1 9 <@> prtf vK a <;> nextstate(main 1 buk.pl:3) v b <#> gv[*fhs] s c <1> rv2av[t8] sK/1 d <1> rand[t9] sK/1 e <1> int[t10] sK/1 f <0> padsv[$r:1,3] sRM*/LVINTRO g <2> sassign vKS/2 h <;> nextstate(main 2 buk.pl:4) v:{ i <#> gv[*bufs] s j <1> rv2av sKR/1 k <0> padsv[$r:1,3] s l <2> aelem sK/2 m <1> length[t12] sKP/1 n <|> or(other->o) vK/1 o <;> nextstate(main 2 buk.pl:5) v:{ p <#> gv[*bufs] s q <1> rv2av sKR/1 r <0> padsv[$r:1,3] s s <2> aelem sKM/2 t <$> const[IV -16] s/FOLD u <$> const[IV 16] s v <$> const[PV ""] s w <@> substr[t16] sK/4 x <0> padsv[$nextRec:2,3] sRM*/LVINTRO y <2> sassign vKS/2 z <0> unstack v goto 10 13 <2> leaveloop vKP/2 14 <@> leave[1 ref] vKP/REFC
      My estimation is that every OP performs between 50 and 100 machine instructions with a good proportion of performance-unfriendly conditional branches. So, roughly, every run of the loop above would require at least 2000 instructions and may be reaching the capacity of the L1 code cache.

      On the other hand, an N-way merge in C, on the hot path of the inner loop, requires just to perform a couple of operations plus the key comparisons log2N times. That may sum up to 50 instructions for every loop cycle (=sorted record)!

      update: Oops, I forgot to say, in theory!

      1 hour to sort 100GB, actually, seems too fast to me. But I think the analysis is sound, and unless I am overlooking some hidden bottleneck, the order or magnitude should be right!

        The code has been running for 8:20 mins and processed 2.8GB; I'll let you do the math.

        On the other hand, an N-way merge in C, on the hot path of the inner loop, requires just to perform a couple of operations plus the key comparisons log2N times. That may sum up to 50 instructions for every loop cycle (=sorted record)!

        Show me the code!


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked