in reply to Re^3: Sorting a (very) large file (better*2)
in thread Sorting a (very) large file

I believe there is this thing called "virtual memory"... Iterating over a list once doesn't have much "thrash" potential, unlike sorting.

Update: Also note that the tons of anonymous arrays of the ST add a lot more memory than most people might think. I suspect that would be about as much or even more memory than required by the copies of each line.

- tye        

  • Comment on Re^4: Sorting a (very) large file (better*2)

Replies are listed 'Best First'.
Re^5: Sorting a (very) large file (better*2)
by samtregar (Abbot) on Nov 30, 2007 at 20:19 UTC
    God damn it, you're going to make me actually try it, aren't you?

    UPDATE: What are you talking about? My solution wasn't an ST!

    -sam

      Damn, that didn't work at all. After creating a suitably sized file with:

         perl -le 'for (0 .. shift()) { $num = int rand(100_000); print qq{12-25-2005\t12:30 PM\t$num\tC:\\someplace\\somefile.txt}; }' 8500000 > file.txt

      I found that just loading the file into memory took up 1.7gb on my system! I tried a few variations without getting anything reasonable. I haven't really done a lot of Perl dev on this box - it's a Fedora 8 machine with the Fedora-built v5.8.8. Could be there's something wrong with it.

      -sam

        this script...
        $| = 1; my @data; $#data = 8500000; $#data = -1; while(<>) { push @data, $_ } print "sorting...\n"; use Sort::Key qw(ukeysort_inplace); ukeysort_inplace { (split(/\t/))[2] } @data; print "sorted\n"; sleep 100;
        ... uses 800 MB for loading the data and then 50MB more for sorting!

      Stess out much?

      You offered, instead of an ST, something that is slower but uses less memory. I prefer to use alternatives to the ST that tend to be two times faster (last time I bothered to time some) while also using a lot less memory.

      You mentioned the memory cost of temporary copies of each line. I don't think we should overlook the cost of the tons of anonymous arrays that also inflict a significant memory cost of an ST.

      - tye        

        Nah, I'm cool. But we both lost at this one - as far as I can tell you can't even load the @lines array of a 400MB file in 1GB of RAM. Unless my Perl is busted, which is possible but seems a bit unlikely.

        -sam