Sorting a large data set

jlf has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Sorting a large data set by clintp (Curate) on Dec 30, 2001 at 05:31 UTC
You could do an in-place sort<super></super> but then you'd give up the speed of using Perl's built-in sort. The problem is that with almost any `@foo=func @foo` you wind up with two copies of `@foo` running around in memory. If you want to write them to disk, yeah, you're pretty much gonna have to format the records one-per-line as you write them. If you're on a unix system, calling an external sort(1) might even be quicker than File::Sort (I'll bet it is). A third consideration is this: if the elements are huge create an "index" array that just has the numbers 0..$#list in it. Then sort it like this: `for(0..1000) { push(@huge, {key=>rand, value=>'hlaghlagh'})} @index=(0..$#huge); @index=sort { $huge[$a]->{key} <=> $huge[$b]->{key} } @index;` [download] At this point you can just use index to access the array elements in sorted order. Problem is in your* case that you've just got arrays of references (the things you're sorting aren't huge, the list itself is). I'm not sure an `@index` array of Sv's is gonna be that much smaller than the references. If that were the case then you could just unwind it with `$_=$huge[$_] for(@index);` later. 12k elements doesn't seem like that much, though. Hrm. <super></super>I always though an in-place sort would be a great* use for scalar or void context `sort`. I haven't got the know-how to come up with the patch for that though.	[reply] [d/l] [select]
Re: Sorting a large data set by khkramer (Scribe) on Dec 30, 2001 at 09:44 UTC
I hate to ask this, but are you sure it's the sort line that's the culprit, or could some other manipulation be causing the out-of-memory problems? I've done lots of sort-and-assigns just like you're doing, and even on very large arrays of hashrefs (circa 100k elements) the overhead from the sort is never more than a few hundred kilobytes. Now to digress (or possibly not) there is one behavior peculiar to sorting arrays of references that I don't understand (and perhaps this -- or a variant -- is what's biting you)... `# for @foo with 100,000 elements, this sort eats 12k of memory @foo = sort { $foo->{bar} cmp $foo->{bar} } @foo; # but for the same foo, this sort eats 90M ! @foo = sort @foo; @foo = sort { $a cmp $b } @foo; # equivalent` [download] As far as I can tell, this "bloat" happens when you try to sort any list of references with the default comparison operator. (I'm running 5.6.1 on linux.) It doesn't happen just because you compare two references inside a sort block... `# requires scads of memory @array_of_refs = sort { $a cmp $b } @array_of_refs; # doesn't @array_of_simple_scalars = sort { \$a cmp \$b } @array_of_simple_scalars;` [download] I would think that the default sort on @array_of_refs would be doing a lexical comparison on the "stringified" ref. But apparently, that's not the case. Even a attempts to force "stringification" inside the sort block (but still refer to the ref) don't fix the problem... `# scads @array_of_stringrefs = sort { ('a: '.$a) cmp ('b: ".$b) } @array_of_stringrefs #scadless @array_of_stringrefs = sort { ('a: '.$$a) cmp ('b: ".$$b) } @array_of_stringrefs` [download] Curiouser and curiouser. Can anyone shed any light on what might be going on here? Kwin	[reply] [d/l] [select]
Re: Re: Sorting a large data set by jlf (Scribe) on Dec 31, 2001 at 10:24 UTC
Thanks, khkramer (and others!) for the responses. The plot thickens. I'm ashamed to admit I had only assumed the `sort` operation was to blame for the out of memory error, but it turns out that the sort finishes successfully, and the operation that's choking is really `print AP_LIST Data::Dumper->Dump([\@list], [qw(*list)]);` I'm using ActiveState so I checked the mailing list, and it turns out this is a bug that was reported to ActivePerl mailing list several weeks ago. Sorry to have dropped a red herring of sorts <grin> but thanks again for the feedback. Josh	[reply] [d/l] [select]
Re: Sorting a large data set by talexb (Chancellor) on Dec 30, 2001 at 09:31 UTC
Another approach is to break the array into smaller chunks, sort each of the chunks. (I guess I'd use temporary disk files somehow, maybe using indexes to your original array?) Then I'd interleave the sorted chunks into a final, sorted monolith. If a few hundred elements work fine during your testing, I'd probably crank it up to a thousand at a time .. you could even run some tests where you optimize, to see how fast you can get it to run. Sounds like fun. :) --t. alex "Excellent. Release the hounds." -- Monty Burns.	[reply]
Re: Sorting a large data set by guha (Priest) on Dec 30, 2001 at 15:41 UTC
If I understand your datastructure correctly the snippet below fits MY brainstructure better. `#!perl -w use strict; use Data::Dumper; my @list = ( { dist => 3, }, { dist => 1, }, { dist => 11, }, ); my @sorted = sort { $a->{dist} <=> $b->{dist} } @list; print Dumper( \@list ); print Dumper( \@sorted );` [download] This also agrees with the general consensus that symbolic refs is a bad thing, aka should be avoided whenever possible. HTH	[reply] [d/l]
Re (tilly) 2: Sorting a large data set by tilly (Archbishop) on Jan 06, 2002 at 00:00 UTC
While the syntax you are using is clearer, the resulting meaning is identical. Your note about symbolic references is a red herring.	[reply]