in reply to Re: Re: Re: Re: Re: Re: (Guildenstern) Re: Re: Taming a memory hog
in thread Taming a memory hog
I thought I'd have a play, so here's my script - it will sort a file 'NUMBERS.DAT' containing lots of 8 digit decimal numbers into 'SORTED.DAT'
For interest's sake I've left all the temporary files in, and it assumes the input file is zero padded so all numbers are 8 digits, no more, no less.
It sorted a 20,000,000 line file (200MB) in 340 seconds (just over 5 minutes) on my PC never using more than 2MB RAM (well, it probably used more memory indirectly due to disk caching)
You would be able to get it quicker by making the 'radix' 100 instead of 10, it'd probably be twice as quick (heck, if you make the radix 100000000, it'll probably sort it in about 90 seconds, but you could run out of file handles ;-) )
(PS - it's actually quite a simple algorithm to understand as well, even I can follow it ;-) )$starttime = time; printf ("start - %d\n",$starttime); $pref = ""; for ($j = 0; $j < 10; $j++) { unlink "-$j"; } sortfile("numbers.dat", $pref, 7); for ($i = 6; $i >= 0; $i--) { for ($j = 0; $j < 10; $j++) { unlink "$pref-A-$j"; } for ($j = 0; $j < 10; $j++) { sortfile("$pref-$j", "$pref-A", $i); } $pref .= "-A"; } printf ("end sort - %d (%d)\n",time, time - $starttime); open (FILE, ">sorted.dat") || die; for ($i = 0; $i < 10; $i++) { open(IN, "$pref-$i") || die; while(<IN>) { print FILE $_; } close(IN); } close(FILE); printf ("end - %d (%d)\n",time, time - $starttime); sub sortfile { my ($source, $pref, $offset) = @_; my $i, @fh; printf("$offset $pref $source - %d\n", time - $starttime); open(FILE, $source) || die; for ($i = 0; $i < 10; $i++) { open($fh[$i], ">>$pref-$i") || die; } while(<FILE>) { $f = $fh[substr($_, $offset, 1)]; print $f $_; } close(FILE); for ($i = 0; $i < 10; $i++) { close($fh[$i]); } }
The algorithm will be able to sort much bigger files in linear time, and without using any more memory
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
312349
by codingchemist (Novice) on Dec 04, 2003 at 22:34 UTC |