in reply to Re: (Guildenstern) Re: Re: Taming a memory hog
in thread Taming a memory hog

This is an interesting problem of dealing with large datasets. currently I am trying to work with files 20,000,000 lines long and am trying to sort them. Do you have any suggestions about sorting? there seems to be a lot of info out there on large datasets, but I haven't seen much on sorting, especially on datasets too large to hold in memory. Thanks
  • Comment on Re: Re: (Guildenstern) Re: Re: Taming a memory hog

Replies are listed 'Best First'.
Re: Re: Re: (Guildenstern) Re: Re: Taming a memory hog
by Roger (Parson) on Nov 12, 2003 at 00:30 UTC
    Hi codingchemist, that depends on how large and complex your data set is.

    If you are dealing with straight forward flat files (like csv type of files), the unix command-line sort is usually the fastest. The syntax is simple too, eg., sort -t "\t" -k 3 input.txt > output.txt will sort the tab delimited input file based on 3rd key.

    Otherwise you could use a different sorting algorithm, like heap sort.

    The heap sort is the slowest of the O(n log n) sorting algorithms, but unlike the merge and quick sorts it doesn't require massive recursion or multiple arrays to work. This makes it the most attractive option for very large data sets of millions of items. There is a description and code example here at the WKU-Linux user group.

      unfortunately, the files are too big to just use a sort on. I tried it, and got a write error. I haven't tried the heap sort. What I am having a problem with is my data is too large to contain in memory, I can only grab chunks of it at a time, otherwise I get a malloc error. I currently am using the merge sort, which works OK, but gets rather slow when I am merging hundreds of times during a single run (literally hundreds of times, the largest array I have been able to keep in memory is 1,000,000, and my file is 200,000,000 lines long, each line being a single entry in my array).
      I think I am using something like the heap sort now, only my "heap" is the actual file.
      Thanks for the link!
        Depending on the data, a 'Radix sort' might be just what you need. This does it with several passes, and you don't need to store more than one record in memory at once (the one you're currently sorting - you don't even need TWO records) Radix sorting is even regarded as 'one of the fastest sorting methods'. (It's slower than QuickSort, but with QuickSort, all your data needs to be in memory at once) Have a google for 'Radix Sorting'

        You might like to try this. It sorts an 80MB file in around 5 mins and consumes less than 10 MB of ram.

        I will also handle sorting files up to the 4 GB filesystem limit using less than 50 MB of ram, though it will obviously run somewhat more slowly. Given more information about the form of the records, it would be possible to tailor the algorithm to speed the processing.

        #! perl -slw use strict; open IN, '+<', $ARGV[0] or die $!; my @splits; my $pos = 0; while( <IN> ) { $splits[ unpack 'n', substr( $_, 0, 2) ] .= pack( 'V', $pos ) . substr $_, 2, 4; $pos = tell IN; } @splits = grep $_, @splits; my $n; for my $split ( @splits ) { $split = join'', map{ substr $_, 0, 4 } sort{ my( $as, $at, $bs, $bt ) = ( unpack( 'VA4', $a ), unpack( 'VA4 +', $b ) ); $at cmp $bt || do{ seek IN, $as, 0; scalar <IN> } cmp do{ seek IN, $bs, 0; scalar <IN> } } unpack '(A8)*', $split; } for my $split ( @splits ) { printf do{ seek IN, $_, 0; scalar <IN> } for unpack 'V*', $split; }

        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Hooray!
        Wanted!