in reply to Re: Re: (Guildenstern) Re: Re: Taming a memory hog
in thread Taming a memory hog

Hi codingchemist, that depends on how large and complex your data set is.

If you are dealing with straight forward flat files (like csv type of files), the unix command-line sort is usually the fastest. The syntax is simple too, eg., sort -t "\t" -k 3 input.txt > output.txt will sort the tab delimited input file based on 3rd key.

Otherwise you could use a different sorting algorithm, like heap sort.

The heap sort is the slowest of the O(n log n) sorting algorithms, but unlike the merge and quick sorts it doesn't require massive recursion or multiple arrays to work. This makes it the most attractive option for very large data sets of millions of items. There is a description and code example here at the WKU-Linux user group.

Replies are listed 'Best First'.
Re: Re: Re: Re: (Guildenstern) Re: Re: Taming a memory hog
by codingchemist (Novice) on Nov 17, 2003 at 04:20 UTC
    unfortunately, the files are too big to just use a sort on. I tried it, and got a write error. I haven't tried the heap sort. What I am having a problem with is my data is too large to contain in memory, I can only grab chunks of it at a time, otherwise I get a malloc error. I currently am using the merge sort, which works OK, but gets rather slow when I am merging hundreds of times during a single run (literally hundreds of times, the largest array I have been able to keep in memory is 1,000,000, and my file is 200,000,000 lines long, each line being a single entry in my array).
    I think I am using something like the heap sort now, only my "heap" is the actual file.
    Thanks for the link!
      Depending on the data, a 'Radix sort' might be just what you need. This does it with several passes, and you don't need to store more than one record in memory at once (the one you're currently sorting - you don't even need TWO records) Radix sorting is even regarded as 'one of the fastest sorting methods'. (It's slower than QuickSort, but with QuickSort, all your data needs to be in memory at once) Have a google for 'Radix Sorting'
        Interesting. I will have to look at the Radix sort more in depth.
        I was thinking of using a modified heap-sort, because that way if I want, I can only grab the top million or so lines. But I will have to look at this Radix sort, it might work really well.
        again, thanks. I will let you know how it turns out, once I finish the script.

      You might like to try this. It sorts an 80MB file in around 5 mins and consumes less than 10 MB of ram.

      I will also handle sorting files up to the 4 GB filesystem limit using less than 50 MB of ram, though it will obviously run somewhat more slowly. Given more information about the form of the records, it would be possible to tailor the algorithm to speed the processing.

      #! perl -slw use strict; open IN, '+<', $ARGV[0] or die $!; my @splits; my $pos = 0; while( <IN> ) { $splits[ unpack 'n', substr( $_, 0, 2) ] .= pack( 'V', $pos ) . substr $_, 2, 4; $pos = tell IN; } @splits = grep $_, @splits; my $n; for my $split ( @splits ) { $split = join'', map{ substr $_, 0, 4 } sort{ my( $as, $at, $bs, $bt ) = ( unpack( 'VA4', $a ), unpack( 'VA4 +', $b ) ); $at cmp $bt || do{ seek IN, $as, 0; scalar <IN> } cmp do{ seek IN, $bs, 0; scalar <IN> } } unpack '(A8)*', $split; } for my $split ( @splits ) { printf do{ seek IN, $_, 0; scalar <IN> } for unpack 'V*', $split; }

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Hooray!
      Wanted!