Re: Re: (Guildenstern) Re: Re: Taming a memory hog

Replies are listed 'Best First'.
Re: Re: Re: (Guildenstern) Re: Re: Taming a memory hog by Roger (Parson) on Nov 12, 2003 at 00:30 UTC
Hi codingchemist, that depends on how large and complex your data set is. If you are dealing with straight forward flat files (like csv type of files), the unix command-line sort is usually the fastest. The syntax is simple too, eg., `sort -t "\t" -k 3 input.txt > output.txt` will sort the tab delimited input file based on 3rd key. Otherwise you could use a different sorting algorithm, like heap sort. The heap sort is the slowest of the O(n log n) sorting algorithms, but unlike the merge and quick sorts it doesn't require massive recursion or multiple arrays to work. This makes it the most attractive option for very large data sets of millions of items. There is a description and code example here at the WKU-Linux user group.	[reply] [d/l]
Re: Re: Re: Re: (Guildenstern) Re: Re: Taming a memory hog by codingchemist (Novice) on Nov 17, 2003 at 04:20 UTC
unfortunately, the files are too big to just use a sort on. I tried it, and got a write error. I haven't tried the heap sort. What I am having a problem with is my data is too large to contain in memory, I can only grab chunks of it at a time, otherwise I get a malloc error. I currently am using the merge sort, which works OK, but gets rather slow when I am merging hundreds of times during a single run (literally hundreds of times, the largest array I have been able to keep in memory is 1,000,000, and my file is 200,000,000 lines long, each line being a single entry in my array). I think I am using something like the heap sort now, only my "heap" is the actual file. Thanks for the link!	[reply]
Re: Re: Re: Re: Re: (Guildenstern) Re: Re: Taming a memory hog by Paul Smith (Initiate) on Nov 17, 2003 at 17:19 UTC
Depending on the data, a 'Radix sort' might be just what you need. This does it with several passes, and you don't need to store more than one record in memory at once (the one you're currently sorting - you don't even need TWO records) Radix sorting is even regarded as 'one of the fastest sorting methods'. (It's slower than QuickSort, but with QuickSort, all your data needs to be in memory at once) Have a google for 'Radix Sorting'	[reply]
Re: Re: Re: Re: Re: Re: (Guildenstern) Re: Re: Taming a memory hog by codingchemist (Novice) on Nov 19, 2003 at 04:34 UTC
Re: Re: Re: Re: Re: Re: Re: (Guildenstern) Re: Re: Taming a memory hog by Paul Smith (Initiate) on Nov 19, 2003 at 17:37 UTC
Some notes below your chosen depth have not been shown here
Re: Re: Re: Re: Re: (Guildenstern) Re: Re: Taming a memory hog by BrowserUk (Patriarch) on Nov 19, 2003 at 06:36 UTC
You might like to try this. It sorts an 80MB file in around 5 mins and consumes less than 10 MB of ram. I will also handle sorting files up to the 4 GB filesystem limit using less than 50 MB of ram, though it will obviously run somewhat more slowly. Given more information about the form of the records, it would be possible to tailor the algorithm to speed the processing. #! perl -slw use strict; open IN, '+<', $ARGV[0] or die $!; my @splits; my $pos = 0; while( <IN> ) { $splits[ unpack 'n', substr( $_, 0, 2) ] .= pack( 'V', $pos ) . substr $_, 2, 4; $pos = tell IN; } @splits = grep $_, @splits; my $n; for my $split ( @splits ) { $split = join'', map{ substr $_, 0, 4 } sort{ my( $as, $at, $bs, $bt ) = ( unpack( 'VA4', $a ), unpack( 'VA4 +', $b ) ); $at cmp $bt \|\| do{ seek IN, $as, 0; scalar <IN> } cmp do{ seek IN, $bs, 0; scalar <IN> } } unpack '(A8)', $split; } for my $split ( @splits ) { printf do{ seek IN, $_, 0; scalar <IN> } for unpack 'V', $split; } [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray! Wanted!	[reply] [d/l]