in reply to Re: Re: Re: (Guildenstern) Re: Re: Taming a memory hog
in thread Taming a memory hog

unfortunately, the files are too big to just use a sort on. I tried it, and got a write error. I haven't tried the heap sort. What I am having a problem with is my data is too large to contain in memory, I can only grab chunks of it at a time, otherwise I get a malloc error. I currently am using the merge sort, which works OK, but gets rather slow when I am merging hundreds of times during a single run (literally hundreds of times, the largest array I have been able to keep in memory is 1,000,000, and my file is 200,000,000 lines long, each line being a single entry in my array).
I think I am using something like the heap sort now, only my "heap" is the actual file.
Thanks for the link!
  • Comment on Re: Re: Re: Re: (Guildenstern) Re: Re: Taming a memory hog

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: (Guildenstern) Re: Re: Taming a memory hog
by Paul Smith (Initiate) on Nov 17, 2003 at 17:19 UTC
    Depending on the data, a 'Radix sort' might be just what you need. This does it with several passes, and you don't need to store more than one record in memory at once (the one you're currently sorting - you don't even need TWO records) Radix sorting is even regarded as 'one of the fastest sorting methods'. (It's slower than QuickSort, but with QuickSort, all your data needs to be in memory at once) Have a google for 'Radix Sorting'
      Interesting. I will have to look at the Radix sort more in depth.
      I was thinking of using a modified heap-sort, because that way if I want, I can only grab the top million or so lines. But I will have to look at this Radix sort, it might work really well.
      again, thanks. I will let you know how it turns out, once I finish the script.

        I thought I'd have a play, so here's my script - it will sort a file 'NUMBERS.DAT' containing lots of 8 digit decimal numbers into 'SORTED.DAT'

        For interest's sake I've left all the temporary files in, and it assumes the input file is zero padded so all numbers are 8 digits, no more, no less.

        It sorted a 20,000,000 line file (200MB) in 340 seconds (just over 5 minutes) on my PC never using more than 2MB RAM (well, it probably used more memory indirectly due to disk caching)

        You would be able to get it quicker by making the 'radix' 100 instead of 10, it'd probably be twice as quick (heck, if you make the radix 100000000, it'll probably sort it in about 90 seconds, but you could run out of file handles ;-) )

        $starttime = time; printf ("start - %d\n",$starttime); $pref = ""; for ($j = 0; $j < 10; $j++) { unlink "-$j"; } sortfile("numbers.dat", $pref, 7); for ($i = 6; $i >= 0; $i--) { for ($j = 0; $j < 10; $j++) { unlink "$pref-A-$j"; } for ($j = 0; $j < 10; $j++) { sortfile("$pref-$j", "$pref-A", $i); } $pref .= "-A"; } printf ("end sort - %d (%d)\n",time, time - $starttime); open (FILE, ">sorted.dat") || die; for ($i = 0; $i < 10; $i++) { open(IN, "$pref-$i") || die; while(<IN>) { print FILE $_; } close(IN); } close(FILE); printf ("end - %d (%d)\n",time, time - $starttime); sub sortfile { my ($source, $pref, $offset) = @_; my $i, @fh; printf("$offset $pref $source - %d\n", time - $starttime); open(FILE, $source) || die; for ($i = 0; $i < 10; $i++) { open($fh[$i], ">>$pref-$i") || die; } while(<FILE>) { $f = $fh[substr($_, $offset, 1)]; print $f $_; } close(FILE); for ($i = 0; $i < 10; $i++) { close($fh[$i]); } }
        (PS - it's actually quite a simple algorithm to understand as well, even I can follow it ;-) )

        The algorithm will be able to sort much bigger files in linear time, and without using any more memory

Re: Re: Re: Re: Re: (Guildenstern) Re: Re: Taming a memory hog
by BrowserUk (Patriarch) on Nov 19, 2003 at 06:36 UTC

    You might like to try this. It sorts an 80MB file in around 5 mins and consumes less than 10 MB of ram.

    I will also handle sorting files up to the 4 GB filesystem limit using less than 50 MB of ram, though it will obviously run somewhat more slowly. Given more information about the form of the records, it would be possible to tailor the algorithm to speed the processing.

    #! perl -slw use strict; open IN, '+<', $ARGV[0] or die $!; my @splits; my $pos = 0; while( <IN> ) { $splits[ unpack 'n', substr( $_, 0, 2) ] .= pack( 'V', $pos ) . substr $_, 2, 4; $pos = tell IN; } @splits = grep $_, @splits; my $n; for my $split ( @splits ) { $split = join'', map{ substr $_, 0, 4 } sort{ my( $as, $at, $bs, $bt ) = ( unpack( 'VA4', $a ), unpack( 'VA4 +', $b ) ); $at cmp $bt || do{ seek IN, $as, 0; scalar <IN> } cmp do{ seek IN, $bs, 0; scalar <IN> } } unpack '(A8)*', $split; } for my $split ( @splits ) { printf do{ seek IN, $_, 0; scalar <IN> } for unpack 'V*', $split; }

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!
    Wanted!