jlf has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I've got a large array (12k+ elements) of hashes and I would like to to sort this array by one of the keys in the hashes. When I was testing the logic with only a few hundred elements, this code

@list = sort { ${$a}{dist} <=> ${$b}{dist} } @list;

was working nicely. But when I tried sorting the complete data set, I encountered out of memory errors.

I believe the path forward is to output each record to disk as it's generated, and then sort the file on disk later. It appears that Chris Nandor's File::Sort module would do the trick, although I suspect it may require a change from using Data::Dumper to writing one record per line. This isn't a problem, but is there a cleaner way to do this?

Thanks!
Josh

Replies are listed 'Best First'.
Re: Sorting a large data set
by clintp (Curate) on Dec 30, 2001 at 05:31 UTC
    You could do an in-place sort<super>*</super> but then you'd give up the speed of using Perl's built-in sort. The problem is that with almost any @foo=func @foo you wind up with two copies of @foo running around in memory.

    If you want to write them to disk, yeah, you're pretty much gonna have to format the records one-per-line as you write them. If you're on a unix system, calling an external sort(1) might even be quicker than File::Sort (I'll bet it is).

    A third consideration is this: if the elements are huge create an "index" array that just has the numbers 0..$#list in it. Then sort it like this:

    for(0..1000) { push(@huge, {key=>rand, value=>'hlaghlagh'})} @index=(0..$#huge); @index=sort { $huge[$a]->{key} <=> $huge[$b]->{key} } @index;
    At this point you can just use index to access the array elements in sorted order. Problem is in your case that you've just got arrays of references (the things you're sorting aren't huge, the list itself is). I'm not sure an @index array of Sv's is gonna be that much smaller than the references. If that were the case then you could just unwind it with $_=$huge[$_] for(@index); later.

    12k elements doesn't seem like that much, though. Hrm.

    <super>*</super>I always though an in-place sort would be a great use for scalar or void context sort. I haven't got the know-how to come up with the patch for that though.

Re: Sorting a large data set
by khkramer (Scribe) on Dec 30, 2001 at 09:44 UTC

    I hate to ask this, but are you sure it's the sort line that's the culprit, or could some other manipulation be causing the out-of-memory problems? I've done lots of sort-and-assigns just like you're doing, and even on very large arrays of hashrefs (circa 100k elements) the overhead from the sort is never more than a few hundred kilobytes.

    Now to digress (or possibly not) there is one behavior peculiar to sorting arrays of references that I don't understand (and perhaps this -- or a variant -- is what's biting you)...

    # for @foo with 100,000 elements, this sort eats 12k of memory @foo = sort { $foo->{bar} cmp $foo->{bar} } @foo; # but for the same foo, this sort eats 90M ! @foo = sort @foo; @foo = sort { $a cmp $b } @foo; # equivalent

    As far as I can tell, this "bloat" happens when you try to sort any list of references with the default comparison operator. (I'm running 5.6.1 on linux.) It doesn't happen just because you compare two references inside a sort block...

    # requires scads of memory @array_of_refs = sort { $a cmp $b } @array_of_refs; # doesn't @array_of_simple_scalars = sort { \$a cmp \$b } @array_of_simple_scalars;

    I would think that the default sort on @array_of_refs would be doing a lexical comparison on the "stringified" ref. But apparently, that's not the case. Even a attempts to force "stringification" inside the sort block (but still refer to the ref) don't fix the problem...

    # scads @array_of_stringrefs = sort { ('a: '.$a) cmp ('b: ".$b) } @array_of_stringrefs #scadless @array_of_stringrefs = sort { ('a: '.$$a) cmp ('b: ".$$b) } @array_of_stringrefs

    Curiouser and curiouser. Can anyone shed any light on what might be going on here?

    Kwin
      Thanks, khkramer (and others!) for the responses.

      The plot thickens. I'm ashamed to admit I had only assumed the sort operation was to blame for the out of memory error, but it turns out that the sort finishes successfully, and the operation that's choking is really

      print AP_LIST Data::Dumper->Dump([\@list], [qw(*list)]);

      I'm using ActiveState so I checked the mailing list, and it turns out this is a bug that was reported to ActivePerl mailing list several weeks ago.

      Sorry to have dropped a red herring of sorts <grin> but thanks again for the feedback.

      Josh
Re: Sorting a large data set
by talexb (Chancellor) on Dec 30, 2001 at 09:31 UTC
    Another approach is to break the array into smaller chunks, sort each of the chunks. (I guess I'd use temporary disk files somehow, maybe using indexes to your original array?)

    Then I'd interleave the sorted chunks into a final, sorted monolith. If a few hundred elements work fine during your testing, I'd probably crank it up to a thousand at a time .. you could even run some tests where you optimize, to see how fast you can get it to run.

    Sounds like fun. :)

    --t. alex

    "Excellent. Release the hounds." -- Monty Burns.

Re: Sorting a large data set
by guha (Priest) on Dec 30, 2001 at 15:41 UTC
    If I understand your datastructure correctly the snippet
    below fits MY brainstructure better.

    #!perl -w use strict; use Data::Dumper; my @list = ( { dist => 3, }, { dist => 1, }, { dist => 11, }, ); my @sorted = sort { $a->{dist} <=> $b->{dist} } @list; print Dumper( \@list ); print Dumper( \@sorted );
    This also agrees with the general consensus that symbolic refs is a bad thing, aka should be avoided whenever possible.

    HTH

      While the syntax you are using is clearer, the resulting meaning is identical. Your note about symbolic references is a red herring.