in reply to Re^7: Sorting a (very) large file (comparison)
in thread Sorting a (very) large file

you can't even load the @lines array of a 400MB file in 1GB of RAM

I believe there is this thing called "virtual memory"... Note that in my original reply I also noted:

And I'd check how much paging space is configured. It sounds like that could be increased.

Virtual memory can be very useful. Especially if you sort something that can fit in physical memory and so only run through paging the larger things around a few times rather than having sort have to page through things something like O(log(N)) times.

Hence my creation of a quite small array that is easy to sort. Actually, this makes me think of a minor change that might have a non-trivial impact on the size of the array that I sort:

my @size= map 0 + ( split /\t/, $_ )[2], @in; # ^^^ store numbers not strings my @idx= sort { $size[$a] <=> $size[$b] }, 0..$#in; @in= @in[@idx];

You could also play games to reduce the number of times you run through @in (each time causing paging), but I think the gains from that would be relatively minor in comparison while the complexity required is relatively great. Except, perhaps, for the last iteration:

for( @idx ) { print $in[$_], $/; }

(Though, I half expect Perl to do @in= @in[@idx]; in-place.)

- tye        

Replies are listed 'Best First'.
Re^9: Sorting a (very) large file (memory)
by ikegami (Patriarch) on Nov 30, 2007 at 21:42 UTC

    (Though, I half expect Perl to do @in= @in[@idx]; in-place.)

    Seems to me from a quick look at pp_hot.c in perl.git that aassign ops where the LHS and RHS contain common members is actually slower, not faster.

    >perl -MO=Concise -e"@in=@in[@idx] e <@> leave[1 ref] vKP/REFC ->(end) 1 <0> enter ->2 2 <;> nextstate(main 1 -e:1) v ->3 d <2> aassign[t7] vKS/COMMON ->e <-- Note "COMMON" - <1> ex-list lK ->a 3 <0> pushmark s ->4 9 <@> aslice lK ->a 4 <0> pushmark s ->5 6 <1> rv2av[t6] lK/1 ->7 5 <#> gv[*idx] s ->6 8 <1> rv2av[t4] sKR/1 ->9 7 <#> gv[*in] s ->8 - <1> ex-list lK ->d a <0> pushmark s ->b c <1> rv2av[t2] lKRM*/1 ->d b <#> gv[*in] s ->c
    1 995: /* If there's a common identifier on both +sides we have to take 1 996: * special care that assigning the identif +ier on the left doesn't 1 997: * clobber a value on the right that's use +d later in the list. 1 998: */ 5492 << 999: if (PL_op->op_private & (OPpASSIGN_COMMON) +) { 2062 << 1000: EXTEND_MORTAL(lastrelem - firstrelem + 1); 5492 << 1001: for (relem = firstrelem; relem <= lastrele +m; relem++) { 5540 << 1002: if ((sv = *relem)) { 18 << br 1003: TAINT_NOT; /* Each item is independ +ent */ 5492 << 1004: *relem = sv_mortalcopy(sv); 18 << br 1005: } 5492 << 1006: } 1 1007: } 1 1008:
Re^9: Sorting a (very) large file (memory)
by samtregar (Abbot) on Nov 30, 2007 at 21:06 UTC
    My experience with algorithms that used so much memory that they went into swap is pretty poor. I'm out of time to spend trying to prove my points today, but I think it's worth trying stay in real RAM if performance is at all important to you!

    -sam

      Well, I'll try one more time, even though you don't seem to be paying much attention, IMHO.

      Iterating over something that doesn't fit in physical memory just once only takes about as much page faulting as it takes to allocate and fill the big thing in the first place (and filling it takes about the same number of page faults whether it fits in memory or not; well, up to twice as many). So iterating over something once can take about twice as long if it doesn't fit in real memory (not a big deal).

      Yes, a non-trivial "algorithm" on something that doesn't fit into real memory will often result in a lot more page faults. sort is O(N*log(N)) so it will visit most of the items being sorted about O(log(N)) times so the list of items not fitting into memory may make the sort run about log(N)-times slower than if the items did fit in memory (assuming that page faulting the list once takes quite a bit more time than the operations on the list required by sort). That is often a huge difference. You needed 8500000 items to reproduce the sizes involved so the slow-down can be propotional to log(8500000) (the "proportional" part means we can't use this theory to predict the multiplier but it could certainly be quite large).

      My approach iterates over the list of lines just a couple of times. So the impact of the lines not fitting into real memory should only make my code run 2-to-4-times slower. Certainly that might be something worth spending more effort to reduce, but it can also certainly be a quite minor concern. And can certainly be a huge win over the cost of sorting the too-big list directly, especially when you add to the too-big list all of the overhead of a ton of anonymous arrays.

      So, no, I don't follow the simplistic conclusion of "it doesn't fit into real memory, don't even bother".

      So, an interesting consequence of these realizations is that my solution, even though it will use more memory than your quite sound solution, is that by sorting something that takes less memory (an array of numbers not an array of lines), my solution may require much fewer page faults. So using more memory can make things run faster when you don't have enough memory. But only if you use the memory wisely.

      - tye        

        I'm not convinced. I think you're neglecting to consider just how much slower disks are than memory. If you can sort the array directly in memory (which I originally thought was possible) but you're close to the memory limit (which I also thought was likely) then you'd be pretty foolish to use more memory (and thus start swapping) just to try to save time.

        If that bothers you sufficiently, here's how you can convince me - write some benchmarking code. I want to see a benchmark that shows that, even if you have to go significantly into swap to do it, that doing a parallel sort is worth it absolutely.

        Who knows, you might be right - I've learned to distrust my gut when it comes to performance. I find hard numbers much more convincing than argument.

        If you don't feel like spending the time I'll understand. I certainly don't.

        -sam