in reply to Sorting based on filesize.

You need the Schwartzian Transform. It precomputes the sizes of all the files, and performs sorting based on the precomputed values, reducing the number of times it would otherwise read file size from the disk.

my @sorted = map {$_->[0] } sort {$a->[1] <=> $b->[1]} map {[$_, -s] } @files;

Replies are listed 'Best First'.
Re^2: Sorting based on filesize.
by grinder (Bishop) on Jul 20, 2004 at 06:28 UTC
    You need the Schwartzian Transform

    I'd say it's doubtful. Clearly, stat is slow, but it's not a slouch, and the Schwartz Transform adds its own overhead. I suspect the sweet spot where the ST pulls ahead is higher than you think.

    If we want to pursue this approach, an improvement will be to employ a Guttman-Rosler Transform:

    my @sorted = map substr($_, 4) => sort map pack(`L',$_) . $_ => @files; # but see eyepopslikeamosquito's comment below

    By being able to dip down and call the sort function without a code block will show greater benefits. With the proviso that one might have to pack with 8 byte 'Q' quads, for systems with filesizes larger than 2Gb.

    With that in mind, it's probably all premature optimisation. I would wait until the snippet showed itself as being a performance pig before going to all this bother:

    sort { -s $a <=> -s $b } @files

    ... does have the distinct advantage of being eminently readable.

    - another intruder with the mooring of the heat of the Perl

      my @sorted = map substr($_, 4) => sort map pack(`L',$_) . $_ => @files;

      For improved portability, that pack('L',$_) above should read pack('N',$_) because 'L' is native while 'N' is network (big-endian). See also Re: Advanced Sorting - GRT - Guttman Rosler Transform.

      I'd say it's doubtful. Clearly, stat is slow, but it's not a slouch, and the Schwartz Transform adds its own overhead. I suspect the sweet spot where the ST pulls ahead is higher than you think.
      Actually, this is the core of one of the alpaca exercises, and I forget the exact numbers, but the ST turns out to be faster after only 10 or 15 files in our investigation. Stat may not be doggy, but it's still a context switch and a lot of bit wrangling. Much better to do that only once per file.

      And (ahem) it's Schwartzian Transform not Schwartz Transform. If you're going to poo-poo it in favor of the more complex GRT, at least spell it right. {grin}

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

        Talking of ST versus GRT comparisons, Uri's new Sort::Maker can do ST, GRT, Orcish (OR-Cache) and plain old sort, and makes things very easy to benchmark and change between different algorithms. It builds sorting code on the fly. Very cool.