in reply to Re^3: Can you improve upon my algorithm.
in thread Can you improve upon my algorithm.
In theory!!!!
In practice!
My E: drive is brand new, barely used and connected to a SATA-3 i/f.
wc -c does almost nothing but increment a variable, and it reads a 34GB file in 192 secs, at a rate of 171MB/s:
E:\>dir 30GB.dat 28/02/2015 08:21 34,560,000,000 30GB.dat [ 0:08:11.82] E:\>wc -c 30GB.dat 34560000000 30GB.dat [ 0:11:23.17] E:\>perl -E"say 34560000000 / 192 / 1024**2;" 171.661376953125
I created 50 1GB files:
I then ran the following script which:
Total runtime (projection): 85 hours; average read rate: 10MB/minute or 171kb/second.
The script:
#! perl -slw use strict; use Time::HiRes qw[ time ]; use constant _10MB => 10*1024**2; $|++; my @fhs; open $fhs[ $_ ], '<:raw', sprintf"%02d", $_ for 1 .. 50; shif +t @fhs; my $start = time; my @bufs; read( $fhs[ $_ ], $bufs[ $_ ], _10MB ) for 0 .. 49; OUTER: while( @fhs ) { printf "\r%d\t", scalar( @fhs ); my $r = int rand @fhs; unless( length( $bufs[ $r ] ) ) { read( $fhs[ $r ], $bufs[ $r ], _10MB ) or do{ close $fhs[ $r ]; splice @fhs, $r, 1; splice @bufs, $r, 1; next OUTER; }; } my $nextRec = substr( $bufs[ $r ], -16, 16, '' ); } my $end = time; printf "Took %d seconds.\n", $end - $start;
Note: there is no heap or merge or insertion sort of the records being 'merged', indeed no comparisons whatsoever; and no writing to disk.
Just the generation of a random number, the copying of 16 bytes, and the adjustment of a pointer (CUR) for each record.
But it happens 3.3 billion times.
With the result that there is almost no IO activity for 98% of the time, and then occasional bursts as 10MB buffers are repopulated.
As the nature of random is to distribute evenly, the 50 x 10MB reads tend to come grouped pretty close together; roughly every 50 minutes.
With 102 buffers per 1GB file, thats 102 * 50 = 5100 minutes or 85 hours or 3 1/2 days; which mirrors the timings I've experienced using external sort programs.
I'm not going to let it run to completion. I hope you'll understand why.
Update:At the point of posting this, the process had just been running for exactly 2 hours and has read 890.4MB.
That's 7.42MB/minute which projects to 4.8 days to complete the 50GB.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: Can you improve upon my algorithm.
by salva (Canon) on Mar 13, 2015 at 10:06 UTC | |
by BrowserUk (Patriarch) on Mar 13, 2015 at 11:14 UTC | |
by salva (Canon) on Mar 13, 2015 at 11:31 UTC | |
by BrowserUk (Patriarch) on Mar 13, 2015 at 11:43 UTC | |
by salva (Canon) on Mar 13, 2015 at 23:28 UTC | |
|