in reply to Re^2: Sorting a (very) large file (better*2)
in thread Sorting a (very) large file

I really doubt that's going to work in 1GB of RAM on a 400MB input (and don't worry about paging space - if you start paging then any speedup you gained just went bye-bye). I guess it's possible, but this line looks pretty bad to me:

   @in = @in[@idx];

Or do you happen to know that Perl does that in-place on @in? I guess the only way to really know would be to try it, but unfortunately I do have some real work to do...

-sam

Replies are listed 'Best First'.
Re^4: Sorting a (very) large file (better*2)
by tye (Sage) on Nov 30, 2007 at 20:15 UTC

    I believe there is this thing called "virtual memory"... Iterating over a list once doesn't have much "thrash" potential, unlike sorting.

    Update: Also note that the tons of anonymous arrays of the ST add a lot more memory than most people might think. I suspect that would be about as much or even more memory than required by the copies of each line.

    - tye        

      God damn it, you're going to make me actually try it, aren't you?

      UPDATE: What are you talking about? My solution wasn't an ST!

      -sam

        Damn, that didn't work at all. After creating a suitably sized file with:

           perl -le 'for (0 .. shift()) { $num = int rand(100_000); print qq{12-25-2005\t12:30 PM\t$num\tC:\\someplace\\somefile.txt}; }' 8500000 > file.txt

        I found that just loading the file into memory took up 1.7gb on my system! I tried a few variations without getting anything reasonable. I haven't really done a lot of Perl dev on this box - it's a Fedora 8 machine with the Fedora-built v5.8.8. Could be there's something wrong with it.

        -sam

        Stess out much?

        You offered, instead of an ST, something that is slower but uses less memory. I prefer to use alternatives to the ST that tend to be two times faster (last time I bothered to time some) while also using a lot less memory.

        You mentioned the memory cost of temporary copies of each line. I don't think we should overlook the cost of the tons of anonymous arrays that also inflict a significant memory cost of an ST.

        - tye