Sorting based on filesize.

heigold1 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Sorting based on filesize. by davido (Cardinal) on Jul 20, 2004 at 02:58 UTC
...assuming @files contains the list of filenames in the CWD: `my @sorted = sort { -s $a <=> -s $b } @files;` [download] Or for a more complete example... `use strict; use warnings; { opendir (my $dir, '.') or die "Can't open the directory.\n$!"; local $, = "\n"; print sort { -s $a <=> -s $b } grep -f, readdir($dir); closedir $dir; }` [download] See perlfunc under the -X functions to understand what `-s` is. Dave	[reply] [d/l] [select]
Re: Sorting based on filesize. by pbeckingham (Parson) on Jul 20, 2004 at 04:08 UTC
You need the Schwartzian Transform. It precomputes the sizes of all the files, and performs sorting based on the precomputed values, reducing the number of times it would otherwise read file size from the disk. `my @sorted = map {$_->[0] } sort {$a->[1] <=> $b->[1]} map {[$_, -s] } @files;` [download]	[reply] [d/l]
Re^2: Sorting based on filesize. by grinder (Bishop) on Jul 20, 2004 at 06:28 UTC
You need the Schwartzian Transform I'd say it's doubtful. Clearly, stat is slow, but it's not a slouch, and the Schwartz Transform adds its own overhead. I suspect the sweet spot where the ST pulls ahead is higher than you think. If we want to pursue this approach, an improvement will be to employ a Guttman-Rosler Transform: my @sorted = map substr($_, 4) => sort map pack(`L',$_) . $_ => @files; # but see eyepopslikeamosquito's comment below [download] By being able to dip down and call the sort function without a code block will show greater benefits. With the proviso that one might have to pack with 8 byte 'Q' quads, for systems with filesizes larger than 2Gb. With that in mind, it's probably all premature optimisation. I would wait until the snippet showed itself as being a performance pig before going to all this bother: `sort { -s $a <=> -s $b } @files` [download] ... does have the distinct advantage of being eminently readable. - another intruder with the mooring of the heat of the Perl	[reply] [d/l] [select]
Re^3: Sorting based on filesize. by eyepopslikeamosquito (Archbishop) on Jul 20, 2004 at 08:41 UTC
my @sorted = map substr($_, 4) => sort map pack(`L',$_) . $_ => @files; For improved portability, that pack('L',$_) above should read pack('N',$_) because 'L' is native while 'N' is network (big-endian). See also Re: Advanced Sorting - GRT - Guttman Rosler Transform.	[reply]
•Re^3: Sorting based on filesize. by merlyn (Sage) on Jul 20, 2004 at 13:20 UTC
I'd say it's doubtful. Clearly, stat is slow, but it's not a slouch, and the Schwartz Transform adds its own overhead. I suspect the sweet spot where the ST pulls ahead is higher than you think. Actually, this is the core of one of the alpaca exercises, and I forget the exact numbers, but the ST turns out to be faster after only 10 or 15 files in our investigation. Stat may not be doggy, but it's still a context switch and a lot of bit wrangling. Much better to do that only once per file. And (ahem) it's Schwartzian Transform not Schwartz Transform. If you're going to poo-poo it in favor of the more complex GRT, at least spell it right. {grin} -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re^4: Sorting based on filesize. by pbeckingham (Parson) on Jul 21, 2004 at 00:46 UTC
Re: Sorting based on filesize. by keszler (Priest) on Jul 20, 2004 at 03:24 UTC
Someone correct me if I'm wrong, but won't `sort { -s $a <=> -s $b } @files;` stat each file multiple times, depending on the size and order in the array? If so, and if RAM is cheaper than stat calls, then `@files = grep { -f } glob '* .*'; @hash{ (map { -s $_ } @files) } = @files; @files = @hash{ sort { $a <=> $b } keys %hash }; print map {$_,$/} @files;` [download] a hash of filenames keyed on sizes could be used instead.	[reply] [d/l] [select]
Re^2: Sorting based on filesize. by graff (Chancellor) on Jul 20, 2004 at 04:40 UTC
a hash of filenames keyed on sizes could be used instead. Well, when two or more files happen to have the same size, the hash will only keep one file name from that set. If that's okay, then yes, a simple hash like this could be used. If that's not okay, then you'd need a hash of arrays: `my %hash; opendir( D, "." ); while ( $_ = readdir D ) { next unless (-f); push @{$hash{-s _}}, $_; # uses previous stat data (from -f) } for my $size ( sort {$a<=>$b} keys %hash ) { for my $file ( sort @{$hash{$size}} ) { print "$size\t$file\n"; } }` [download] update: added comment about how many times stat is actually called on each file (i.e. just once, not twice), and wanted to point out that the Schwartzian Transform as demonstrated by pbeckingham is likely to be the best solution overall.	[reply] [d/l]
Re^3: Sorting based on filesize. by keszler (Priest) on Jul 21, 2004 at 12:02 UTC
Good point. This fixes it: `@files = grep { -f } glob '* .*'; @hash{ (map { (-s $_) . ".$_"} @files) } = @files; @files = @hash{ sort { $a <=> $b } keys %hash }; print map {$_,$/} @files;` [download] but it's rather a moot point - the Schwartzian Transform goes a step further and eliminates the hash.	[reply] [d/l]
Re: Sorting based on filesize. by eyepopslikeamosquito (Archbishop) on Jul 20, 2004 at 03:03 UTC
Here's an example to get you started: `my @files = glob("*"); my @sorted = sort { -s $a <=> -s $b } @files;` [download]	[reply] [d/l]
Re: Sorting based on filesize. by cLive ;-) (Prior) on Jul 20, 2004 at 02:44 UTC
Yes. cLive ;-) ps - you might want to look up "File Test Operators"	[reply]