Re^6: Sorting a (very) large file (comparison)

Replies are listed 'Best First'.
Re^7: Sorting a (very) large file (comparison) by samtregar (Abbot) on Nov 30, 2007 at 20:41 UTC
Nah, I'm cool. But we both lost at this one - as far as I can tell you can't even load the @lines array of a 400MB file in 1GB of RAM. Unless my Perl is busted, which is possible but seems a bit unlikely. -sam	[reply]
Re^8: Sorting a (very) large file (memory) by tye (Sage) on Nov 30, 2007 at 20:56 UTC
you can't even load the @lines array of a 400MB file in 1GB of RAM I believe there is this thing called "virtual memory"... Note that in my original reply I also noted: And I'd check how much paging space is configured. It sounds like that could be increased. Virtual memory can be very useful. Especially if you sort something that can fit in physical memory and so only run through paging the larger things around a few times rather than having sort have to page through things something like O(log(N)) times. Hence my creation of a quite small array that is easy to sort. Actually, this makes me think of a minor change that might have a non-trivial impact on the size of the array that I sort: `my @size= map 0 + ( split /\t/, $_ )[2], @in; # ^^^ store numbers not strings my @idx= sort { $size[$a] <=> $size[$b] }, 0..$#in; @in= @in[@idx];` [download] You could also play games to reduce the number of times you run through `@in` (each time causing paging), but I think the gains from that would be relatively minor in comparison while the complexity required is relatively great. Except, perhaps, for the last iteration: `for( @idx ) { print $in[$_], $/; }` [download] (Though, I half expect Perl to do `@in= @in[@idx];` in-place.) - tye	[reply] [d/l] [select]
Re^9: Sorting a (very) large file (memory) by ikegami (Patriarch) on Nov 30, 2007 at 21:42 UTC
(Though, I half expect Perl to do `@in= @in[@idx];` in-place.) Seems to me from a quick look at pp_hot.c in perl.git that `aassign` ops where the LHS and RHS contain common members is actually slower, not faster. `>perl -MO=Concise -e"@in=@in[@idx] e <@> leave[1 ref] vKP/REFC ->(end) 1 <0> enter ->2 2 <;> nextstate(main 1 -e:1) v ->3 d <2> aassign[t7] vKS/COMMON ->e <-- Note "COMMON" - <1> ex-list lK ->a 3 <0> pushmark s ->4 9 <@> aslice lK ->a 4 <0> pushmark s ->5 6 <1> rv2av[t6] lK/1 ->7 5 <#> gv[idx] s ->6 8 <1> rv2av[t4] sKR/1 ->9 7 <#> gv[in] s ->8 - <1> ex-list lK ->d a <0> pushmark s ->b c <1> rv2av[t2] lKRM/1 ->d b <#> gv[in] s ->c` [download] 1 995: /* If there's a common identifier on both +sides we have to take 1 996: * special care that assigning the identif +ier on the left doesn't 1 997: * clobber a value on the right that's use +d later in the list. 1 998: / 5492 << 999: if (PL_op->op_private & (OPpASSIGN_COMMON) +) { 2062 << 1000: EXTEND_MORTAL(lastrelem - firstrelem + 1); 5492 << 1001: for (relem = firstrelem; relem <= lastrele +m; relem++) { 5540 << 1002: if ((sv = relem)) { 18 << br 1003: TAINT_NOT; /* Each item is independ +ent / 5492 << 1004: relem = sv_mortalcopy(sv); 18 << br 1005: } 5492 << 1006: } 1 1007: } 1 1008: [download]	[reply] [d/l] [select]
Re^9: Sorting a (very) large file (memory) by samtregar (Abbot) on Nov 30, 2007 at 21:06 UTC
My experience with algorithms that used so much memory that they went into swap is pretty poor. I'm out of time to spend trying to prove my points today, but I think it's worth trying stay in real RAM if performance is at all important to you! -sam	[reply]
Re^10: Sorting a (very) large file (bigger and faster?) by tye (Sage) on Nov 30, 2007 at 21:43 UTC
Re^11: Sorting a (very) large file (bigger and faster?) by samtregar (Abbot) on Nov 30, 2007 at 22:22 UTC