in reply to Sorting a large data set

I hate to ask this, but are you sure it's the sort line that's the culprit, or could some other manipulation be causing the out-of-memory problems? I've done lots of sort-and-assigns just like you're doing, and even on very large arrays of hashrefs (circa 100k elements) the overhead from the sort is never more than a few hundred kilobytes.

Now to digress (or possibly not) there is one behavior peculiar to sorting arrays of references that I don't understand (and perhaps this -- or a variant -- is what's biting you)...

# for @foo with 100,000 elements, this sort eats 12k of memory @foo = sort { $foo->{bar} cmp $foo->{bar} } @foo; # but for the same foo, this sort eats 90M ! @foo = sort @foo; @foo = sort { $a cmp $b } @foo; # equivalent

As far as I can tell, this "bloat" happens when you try to sort any list of references with the default comparison operator. (I'm running 5.6.1 on linux.) It doesn't happen just because you compare two references inside a sort block...

# requires scads of memory @array_of_refs = sort { $a cmp $b } @array_of_refs; # doesn't @array_of_simple_scalars = sort { \$a cmp \$b } @array_of_simple_scalars;

I would think that the default sort on @array_of_refs would be doing a lexical comparison on the "stringified" ref. But apparently, that's not the case. Even a attempts to force "stringification" inside the sort block (but still refer to the ref) don't fix the problem...

# scads @array_of_stringrefs = sort { ('a: '.$a) cmp ('b: ".$b) } @array_of_stringrefs #scadless @array_of_stringrefs = sort { ('a: '.$$a) cmp ('b: ".$$b) } @array_of_stringrefs

Curiouser and curiouser. Can anyone shed any light on what might be going on here?

Kwin

Replies are listed 'Best First'.
Re: Re: Sorting a large data set
by jlf (Scribe) on Dec 31, 2001 at 10:24 UTC
    Thanks, khkramer (and others!) for the responses.

    The plot thickens. I'm ashamed to admit I had only assumed the sort operation was to blame for the out of memory error, but it turns out that the sort finishes successfully, and the operation that's choking is really

    print AP_LIST Data::Dumper->Dump([\@list], [qw(*list)]);

    I'm using ActiveState so I checked the mailing list, and it turns out this is a bug that was reported to ActivePerl mailing list several weeks ago.

    Sorry to have dropped a red herring of sorts <grin> but thanks again for the feedback.

    Josh