What really surprised me was how slow it became once I tried to sort the data just once
Agreed! Not sure if this is a bug or a feature but I found it very surprising!
On my machine, when testing Discipulus' superb one liner just now:
perl -lae "$r{$F[0]}+=$F[1]}{print qq($_\t$r{$_})for sort{$r{$b}<=>$r{
+$a}||$a cmp $b}keys %r" big1.txt big2.txt big3.txt >oneliner.tmp
I almost fell off my chair when
the execution time dropped from 92 seconds to 38 seconds (with identical correct output) simply by changing:
sort{$r{$b}<=>$r{$a}||$a cmp $b}keys %r
to:
sort{$r{$b}<=>$r{$a}}sort{$a cmp $b}keys %r
After picking myself up off the floor, I ran:
perl -MO=Terse -e "sort{$r{$b}<=>$r{$a}||$a cmp $b}keys %r"
LISTOP (0x299c080) leave [1]
OP (0x299c050) enter
COP (0x299c0c0) nextstate
LISTOP (0x299c120) sort
OP (0x299c160) pushmark
UNOP (0x2882ed0) null
LISTOP (0x2882fb0) scope
COP (0x2882ff0) null [195]
UNOP (0x2883050) null
LOGOP (0x2883088) or
BINOP (0x28831e8) ncmp [5]
UNOP (0x26610d0) null [150]
UNOP_AUX (0x299c010) multideref
OP (0x26611b8) null [7]
UNOP (0x2883228) null [150]
UNOP_AUX (0x299bfd0) multideref
OP (0x2661098) null [7]
BINOP (0x28830c8) scmp [8]
UNOP (0x2883178) null [14]
PADOP (0x28831b0) gvsv GV (0x18a8c0)
+*a
UNOP (0x2883108) null [14]
PADOP (0x2883140) gvsv GV (0x2655320)
+ *b
UNOP (0x2882f08) keys [11]
UNOP (0x2882f40) rv2hv [10]
PADOP (0x2882f78) gv GV (0x2654ae0) *r
which is 25 OPs ... while:
perl -MO=Terse -e "sort{$r{$b}<=>$r{$a}}sort{$a cmp $b}keys %r"
LISTOP (0x2940c18) leave [1]
OP (0x2940c90) enter
COP (0x2940bb8) nextstate
LISTOP (0x2940b78) sort
OP (0x2940c58) pushmark
UNOP (0x2990018) null
LISTOP (0x2940d38) scope
COP (0x2940d78) null [195]
BINOP (0x2940dd8) ncmp [5]
UNOP (0x2644940) null [150]
UNOP_AUX (0x298ff68) multideref
OP (0x2644a28) null [7]
UNOP (0x2940e18) null [150]
UNOP_AUX (0x298ff28) multideref
OP (0x2644908) null [7]
LISTOP (0x298ffa8) sort
OP (0x298ffe8) pushmark
UNOP (0x2940ad0) keys [11]
UNOP (0x2940b08) rv2hv [10]
PADOP (0x2940b40) gv GV (0x26344d8) *r
and (update):
perl -MO=Terse -e "sort{$r{$b}<=>$r{$a}}sort keys %r"
LISTOP (0x29e2518) leave [1]
OP (0x29bda60) enter
COP (0x29e2558) nextstate
LISTOP (0x29e25b8) sort
OP (0x29e25f8) pushmark
UNOP (0x29e2628) null
LISTOP (0x29e2778) scope
COP (0x29e27b8) null [195]
BINOP (0x29e2818) ncmp [5]
UNOP (0xed31b0) null [150]
UNOP_AUX (0x29bda20) multideref
OP (0xed3298) null [7]
UNOP (0x29e2858) null [150]
UNOP_AUX (0x29bd9e0) multideref
OP (0xed3178) null [7]
LISTOP (0x29e2660) sort
OP (0x29e26a0) pushmark
UNOP (0x29e26d0) keys [8]
UNOP (0x29e2708) rv2hv [7]
PADOP (0x29e2740) gv GV (0xec4098) *r
are just 20 OPs. I'm out of my depth here, so we might need
a dave_the_m-class Perl internals guru to get to the bottom of this mystery. :)
Update:
I believe the second example above can be shortened from:
sort{$r{$b}<=>$r{$a}}sort{$a cmp $b}keys %r
to simply:
sort{$r{$b}<=>$r{$a}}sort keys %r
...also, for correctness, I think we should further add:
use sort 'stable';
at the top of the program because we are relying on a stable sort for this trick to work.
Further update: Oops, this is not necessary, didn't read far enough in the
sort docs,
"The default sort has been stable since v5.8.0, and given this consistent behaviour for almost two decades, everyone has come to assume stability".
Updated: fixed typo, added -e after -MO=Terse in examples above.
Added comment about stable sort.