Re: finding top 10 largest files
by Abigail-II (Bishop) on Feb 02, 2004 at 23:52 UTC
|
This is how I would do it:
#!/usr/bin/perl
use strict;
use warnings;
no warnings qw /syntax/;
open my $fh => "find / -type f |" or die;
my @sizes = map {[-1 => ""]} 1 .. 10;
while (<$fh>) {
chomp;
my $size = -s;
next if $size < $sizes [-1] [0];
foreach my $i (0 .. $#sizes) {
if ($size >= $sizes [$i] [0]) {
splice @sizes => $i, 0 => [$size => $_];
pop @sizes;
last;
}
}
}
@sizes = grep {length $_ -> [1]} @sizes;
printf "%8d: %s\n" => @$_ for @sizes;
__END__
Purists may want to use File::Find instead of find.
Abigail | [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
|
|
|
|
Please excuse me, for being 95% off topic, but maybe a search should reveal the following link. For me Perl is mainly commandline work and I love less, which, find, etc and their combinations.
Thank You.
With the exception of which, which has a better Perl equivalent called pwhich, you should try http://unxutils.sourceforge.net/ as standalone alternative for cygwin.
Quote from description:
Here are some ports of common GNU utilities to native Win32. In this context, native means the executables do only depend on the Microsoft C-runtime (msvcrt.dll) and not an emulation layer like that provided by Cygwin tools.
But nevertheless you run into problems with find and echo as they have DOS equivalents commands with the same name, but limited functionality. If you have Novell, you will hit e.g. ls, depending on your path.
And it came to pass that in time the Great God Om spake unto Brutha, the Chosen One: "Psst!"
(Terry Pratchett, Small Gods)
| [reply] [Watch: Dir/Any] |
|
|
Purists may want to use File::Find instead of find.
Also, people who don't like to write horrible crash-prone and security-holed scripts might like to use the -print0 action when backticking (or "open-to-pipe"ing) find commands, coupled, of course with local $/ = "\0";.
Oh, likewise, if you're finding into a xargs call... it's always a good idea to find ... -print0 | xargs -0 .... Anybody ever think that there should be a shellmonks?
------------
:Wq
Not an editor command: Wq
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Though I doubt any one wants to be known as a shellmonk, it would be very useful to have a few pages of "now that you know it Perl, how do you do it the hard way", just as reference :)
I don't know how many times I skip past a find or a grep and jump straight into Perl due to my general slackness and intolerance for the various idiosyncracies.
I also find that slackness causing me to use 'slocate' instead of 'find', but that's impatience and thus is a virtue :) Let's face it though. Perl is just easier.
| [reply] [Watch: Dir/Any] |
|
Purists may want to use File::Find instead of find.
That is, purists with lots of time on their hands... Whenever I have tried to compare "find" vs. "File::Find", the perl module seems to take about 5 time longer than the compiled utility, in terms of wallclock time to execute.
| [reply] [Watch: Dir/Any] |
|
Not alone is find usually faster to execute than
File::Find, it takes less programmer time as well.
Abigail
| [reply] [Watch: Dir/Any] |
Re: finding top 10 largest files
by tachyon (Chancellor) on Feb 03, 2004 at 00:07 UTC
|
for my $key( sort { $b <=> $a } keys %sizehash ) {
# blah
}
HOWEVER -> From the practical point of view your data structure is ass about. You really want the full path as the key (unique) not the file size (not unique). For example if you have two files of 1,000,001 bytes then only the second one you find will be seen as it will overwite the filename stored with the 1,000,001 byte key.
To sort a hash numerically by value once you fix the BUG you would do:
for my $file( sort { $sizehash{$b} <=> $sizehash{$a} } keys %sizehash
+) {
my $size = commas($sizehash{$file});
last if $y++ > 10;
print "$size: $file\n";
}
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
I knew that my data structure was really mucked up but needed to get some code working. I had tried for some time to get some sort of data structure that would keep only 10 items in it but could not figure it out. So, speed of initial implementation won out.
The comments and recommendations are well received. Thanks!
Ed
| [reply] [Watch: Dir/Any] |
|
Amazed you replied so late! Anyway we all understand pressure....but....it is quicker to do it right the first time so you don't have to redo it later.
Glad it helped you. Perl is a real power tool on any system. On Win32 you can for example get a file grep with a one liner like:
perl -ne "print "$.\t$_" if m/whatever/" file.txt
| [reply] [Watch: Dir/Any] [d/l] |
Re: finding top 10 largest files
by borisz (Canon) on Feb 03, 2004 at 00:02 UTC
|
and here is my try.
the first param is the directory to search.
#!/usr/bin/perl
use File::Find;
File::Find::find(
{
wanted => sub {
return unless -f;
my $s = -s _;
return if $min < $s && @z > 10;
push @z, [ $File::Find::name, $s ];
@z = sort { $b->[1] <=> $a->[1] } @z;
pop @z if @z > 10;
$min = $z[-1]->[1];
}
},
shift || '.'
);
for (@z) {
print $_->[0], " ", $_->[1], "\n";
}
| [reply] [Watch: Dir/Any] [d/l] |
|
My idea is very similar to both borisz's and Abigail's.
borisz's solution has some "issues":
- the return line should be
return if $s < $min && @z == 10;
- the numeric sort is inefficient because it entails several perl ops.
My solution uses GRT for maximum efficiency.
File::Find::find( sub {
return unless -f;
@z = sort @z, sprintf( "%32d:", -s ) . $File::Find::name;
shift @z if @z > 10;
}, $dir );
print "$_\n" for @z;
jdporter The 6th Rule of Perl Club is -- There is no Rule #6. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
Hi Boris,
You perl code searches top 10 files on all sub directories of different filesystem. Example., I have /var and /var/log filesystems, my requirement is to search top 10 files under /var but your code searches through /var/log too. is there a way to restrict this in your script.?
| [reply] [Watch: Dir/Any] |
Re: finding top 10 largest files
by graff (Chancellor) on Feb 03, 2004 at 02:57 UTC
|
If you happen to be on a system with lots of files and directories, you may want to avoid any strategy that involves keeping all the path/file names in a hash. I've been burned by this, using a script that was originally created for CDROMS and trying to use it on DVDS -- one day, somebody actually ran it on a DVD that just happened to contain over a million files, and it brought the system to its knees.
In this regard, Abigail's original suggestion seems best -- using a pipeline file handle that runs "find" is very fast and economical in terms of memory, and only keeping track of the 10 largest files seen so far will assure that the script won't blow up as the file space gets bigger. | [reply] [Watch: Dir/Any] |
|
< Please don't burn me ;-) >
Since you're on Windows, there's always:
Start->Search->For Files or Folders
Pick the Drive, Go to Size, then pick a limit, say 'at least 10000 Kb', then 'Find'
Sort the results by clicking on 'Size'
cheers
zeitgheist
| [reply] [Watch: Dir/Any] |
Re: finding top 10 largest files
by pelagic (Priest) on Feb 03, 2004 at 12:07 UTC
|
And yet annother possibility I found in my code stash.
It lists the top n filesizes with all files having these sizes:
#!/usr/bin/perl
use strict;
my ($dir_init, $top_n) = @ARGV;
my (@ls, @dirs, %top, $size, $obsolete_size);
my $sep = "~" x 80;
my $form = "%20s %s\n";
my $min_size = 0;
my $files = 0;
push(@dirs, $dir_init);
call it like:
perl fszTop10.pl /u2/integ 10
and the output looks like:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Total number of files examined: 2331
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
List of top 10 sized files within /u2/integ/t096351 :
23,836,788
/u2/integ/t096351/tmp/PDF/NV61DE.PDF
5,731,355
/u2/integ/t096351/test/gsp_toCMT.tar.gz
180,042
/u2/integ/t096351/jcs553_functions_ALL.sql.v1
/u2/integ/t096351/jcs553_functions_ALL.sql.v2
/u2/integ/t096351/jcs553_functions_ALL.sql.copy
105,520
/u2/integ/t096351/ccm_ui.log
56,216
/u2/integ/t096351/ccm_eng.log
24,072
/u2/integ/t096351/dead.letter
14,506
/u2/integ/t096351/.ccm.ini
11,492
/u2/integ/t096351/.sh_history
5,111
/u2/integ/t096351/.dtprofile
2,561
/u2/integ/t096351/.Xauthority
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pelagic
Update
I had to fix a bug I found, as I was using the script again. | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: finding top 10 largest files
by ambrus (Abbot) on Feb 03, 2004 at 09:13 UTC
|
I think a good way to do this is to write a Perl script selecting the files larger than, say, 20M; and do not care with the sorting in it.
Then you can easily sort|head the resulting few files.
By the way, I have some old code doing something similar. I have a file with a lot of records, and I have to select the 64 "best" records.
Here's the code, with omissions (##)
## ... HEADER STUFF ...
@sellpt= (0)x64;
@sellx= @sellpt;
@selly= @sellpt;
## ... MORE HEADER STUFF, OPENING HANDLE "VAR" ...
while (<VAR>) {
my($p,$x,$y);
## FILLING ($x,$y,$p) WITH DATA FROM RECORDS, MUST GET THE ONE
+S WITH LARGEST $p
$sellpt[-1]<$p and do {
my ($u, $v)= (0, 0+@sellpt);
defined ($sellpt[int(($u+$v)/2)]) or
die qq([D30] $u $v (@sellpt));
while ($u<$v) {
if ($p<$sellpt[int(($u+$v)/2)])
{ $u= int(($u+$v)/2)+1 }
else
{ $v= int(($u+$v)/2) };
}
@sellpt= (@sellpt[0..$u-1], $p, @sellpt[$u..@sellpt-2]
+);
@sellx= (@sellx[0..$u-1], $x, @sellx[$u..@sellx-2]);
@selly= (@selly[0..$u-1], $y, @selly[$u..@selly-2]);
};
## ... SOME MORE STUFF ...
};
## ... AND SOME MORE, WRITING OUT @sellx,@selly IN SUITABLE FORMAT ...
| [reply] [Watch: Dir/Any] [d/l] |
Re: finding top 10 largest files
by rje (Deacon) on Feb 03, 2004 at 15:08 UTC
|
Rather than holding every file in the hash, could you instead only keep the top 10 entries with each iteration? I suppose that might be slow, eh?
Pseudoperl follows... caveat coder.
my @topTen = ();
while( moreFiles() )
{
push @topTen, some_munged_key_using_file_size_and_name();
@topTen = (reverse sort @topTen)[0..9];
}
Sorry if I'm missing something. My brain's not working well today...
| [reply] [Watch: Dir/Any] [d/l] |
|
... that's what my solution (see 326175) does ...
Only keep the information for the currently known biggest files.
pelagic
| [reply] [Watch: Dir/Any] |
|
Ugh! Yep, you're right, my bad... only, the code is too long for my liking. Can it be shorter? Could I use the operating system to pare down the code a bit by pre-gathering a list of all files? Something like
my @files = (reverse sort `dir /A-D /S`)[0..9];
In DOS, or ... (big pause)...
Aw shoot, in DOS the solution isn't even perl:
dir /A-D /O-S /S
That recursively lists all files from the current working directory on, sorted by largest file first. I imagine there's a combination of opts to ls that will do the same thing, eh? Maybe
ls -alSR (which doesn't sort across directories)
ls -alR | sort -k 5 (maybe?)
(Except those don't suppress the directory names. Hmmm.)
Sorry, I meant to write perl, but it came out rather OS-specific... but it's a lot smaller than the perl solution. Is that Appeal To False Laziness?
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
Re: finding top 10 largest files
by ambrus (Abbot) on Mar 05, 2004 at 14:33 UTC
|
find / -printf "%s\t%p\n" | perl heap
where heap is the perl program listed at the node
Re: Re: Re: Re: Sorting values of nested hash refs.
The find prints all files together with sizes.
The script uses a binary heap to select the
32 largest ones from it.
For more information, read its thread
which is about a similar problem.
Update: on my system, this finds /proc/kcore
as one of the largest files. You'll have to find out how not to
include the /proc fliesystem in thesearch. Also,
this will list hard links multiple times. | [reply] [Watch: Dir/Any] [d/l] |