Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

finding top 10 largest files

by bfdi533 (Friar)
on Feb 02, 2004 at 23:32 UTC ( [id://326053]=perlquestion: print w/replies, xml ) Need Help??

bfdi533 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to find the top 10 largest files on my system.

At first I thought I would try to use some sort of stack to keep track of the 10 highest but the code quickly went beyond my mental reach.

Then I thought I would just store the file names of all files in a has with the key being the file size. This pretty well works but is a really bad idea for a filesystem with lots of files. But, I implemented this to see if I could get the top 10 out of it. I ran into trouble with the output as it did not print out expected results.

Here is the code to output the top 10 files:

$y = 0; for $key (sort { $hash{$b} <=> $hash{$a} || length($b) <=> length($a) +} keys %sizehash) { if ($y < 10) { $res = &commas($key); print "$res: $sizehash{$key}\n"; $y++; } }

And here is the output I got:

Total files: 104644 Largest file: c:\/My Virtual Machines/Gentoo/Gentoo Largest file size: 1,533,542,400 Smallest file: c:\ Smallest file size: 0 1,008,164,864: c:\/Program Files/dtSearch/UserData/personal/index_k_4. +ix 1,533,542,400: c:\/My Virtual Machines/Gentoo/Gentoo 102,177,098: c:\/Data/obbd/Org Basic Building Binder/Building Binder.z +ip 135,019,052: c:\/My Virtual Machines/Gentoo/Gentoo.vmss 148,851,791: c:\/Data/Paraben/foch-beta.rar 569,366,528: c:\/data_transfer/Software/ISO/en_windows_server_2003_ent +erprise_vl.iso 144,244,736: c:\/Documents and Settings/davisone/Local Settings/Applic +ation Data/Microsoft/Outlook/archive_2003q3.pst 344,746,496: c:\/My Documents/My Virtual Machines/Windows95/Windows 98 +.vmdk 176,308,736: c:\/My Documents/My Virtual Machines/Windows95/Windows95. +vmdk 524,288,000: c:\/Data/Personal.vol

As you can see the output is not numerically sorted like I would expect.

What have I done wrong here?

Also any recommendations on a better way to implement a search like this than using a hash entry for each file on the filesystem would be great.

Ed

Replies are listed 'Best First'.
Re: finding top 10 largest files
by Abigail-II (Bishop) on Feb 02, 2004 at 23:52 UTC
    This is how I would do it:
    #!/usr/bin/perl use strict; use warnings; no warnings qw /syntax/; open my $fh => "find / -type f |" or die; my @sizes = map {[-1 => ""]} 1 .. 10; while (<$fh>) { chomp; my $size = -s; next if $size < $sizes [-1] [0]; foreach my $i (0 .. $#sizes) { if ($size >= $sizes [$i] [0]) { splice @sizes => $i, 0 => [$size => $_]; pop @sizes; last; } } } @sizes = grep {length $_ -> [1]} @sizes; printf "%8d: %s\n" => @$_ for @sizes; __END__
    Purists may want to use File::Find instead of find.

    Abigail

      Purists may want to use File::Find instead of find.

      Purists or Win32 users like the OP for example....

      cheers

      tachyon

        Oh, come on. You must know by now that the standard Unix tools have been ported to Windows, multiple times?

        There's no reason to feel left out if you're on a Windows platform, and someone uses 'find'.

        Abigail

        Please excuse me, for being 95% off topic, but maybe a search should reveal the following link. For me Perl is mainly commandline work and I love less, which, find, etc and their combinations.
        Thank You.

        With the exception of which, which has a better Perl equivalent called pwhich, you should try http://unxutils.sourceforge.net/ as standalone alternative for cygwin.
        Quote from description:

        Here are some ports of common GNU utilities to native Win32. In this context, native means the executables do only depend on the Microsoft C-runtime (msvcrt.dll) and not an emulation layer like that provided by Cygwin tools.

        But nevertheless you run into problems with find and echo as they have DOS equivalents commands with the same name, but limited functionality. If you have Novell, you will hit e.g. ls, depending on your path.

        And it came to pass that in time the Great God Om spake unto Brutha, the Chosen One: "Psst!"
        (Terry Pratchett, Small Gods)

      Purists may want to use File::Find instead of find.

      Also, people who don't like to write horrible crash-prone and security-holed scripts might like to use the -print0 action when backticking (or "open-to-pipe"ing) find commands, coupled, of course with local $/ = "\0";.

      Oh, likewise, if you're finding into a xargs call... it's always a good idea to find ... -print0 | xargs -0 .... Anybody ever think that there should be a shellmonks?

      ------------ :Wq Not an editor command: Wq
        Though I doubt any one wants to be known as a shellmonk, it would be very useful to have a few pages of "now that you know it Perl, how do you do it the hard way", just as reference :)

        I don't know how many times I skip past a find or a grep and jump straight into Perl due to my general slackness and intolerance for the various idiosyncracies.

        I also find that slackness causing me to use 'slocate' instead of 'find', but that's impatience and thus is a virtue :) Let's face it though. Perl is just easier.

      Purists may want to use File::Find instead of find.

      That is, purists with lots of time on their hands... Whenever I have tried to compare "find" vs. "File::Find", the perl module seems to take about 5 time longer than the compiled utility, in terms of wallclock time to execute.

        Not alone is find usually faster to execute than File::Find, it takes less programmer time as well.

        Abigail

Re: finding top 10 largest files
by tachyon (Chancellor) on Feb 03, 2004 at 00:07 UTC

    You have file size as the key to your hash so sort just needs to be:

    for my $key( sort { $b <=> $a } keys %sizehash ) { # blah }

    HOWEVER -> From the practical point of view your data structure is ass about. You really want the full path as the key (unique) not the file size (not unique). For example if you have two files of 1,000,001 bytes then only the second one you find will be seen as it will overwite the filename stored with the 1,000,001 byte key.

    To sort a hash numerically by value once you fix the BUG you would do:

    for my $file( sort { $sizehash{$b} <=> $sizehash{$a} } keys %sizehash +) { my $size = commas($sizehash{$file}); last if $y++ > 10; print "$size: $file\n"; }

    cheers

    tachyon

      I knew that my data structure was really mucked up but needed to get some code working. I had tried for some time to get some sort of data structure that would keep only 10 items in it but could not figure it out. So, speed of initial implementation won out.

      The comments and recommendations are well received. Thanks!

      Ed

        Amazed you replied so late! Anyway we all understand pressure....but....it is quicker to do it right the first time so you don't have to redo it later.

        Glad it helped you. Perl is a real power tool on any system. On Win32 you can for example get a file grep with a one liner like:

        perl -ne "print "$.\t$_" if m/whatever/" file.txt

        cheers

        tachyon

Re: finding top 10 largest files
by borisz (Canon) on Feb 03, 2004 at 00:02 UTC
    and here is my try. the first param is the directory to search.
    #!/usr/bin/perl use File::Find; File::Find::find( { wanted => sub { return unless -f; my $s = -s _; return if $min < $s && @z > 10; push @z, [ $File::Find::name, $s ]; @z = sort { $b->[1] <=> $a->[1] } @z; pop @z if @z > 10; $min = $z[-1]->[1]; } }, shift || '.' ); for (@z) { print $_->[0], " ", $_->[1], "\n"; }
    Boris
      My idea is very similar to both borisz's and Abigail's. borisz's solution has some "issues":
      1. the return line should be
        return if $s < $min && @z == 10;
      2. the numeric sort is inefficient because it entails several perl ops. My solution uses GRT for maximum efficiency.
      File::Find::find( sub { return unless -f; @z = sort @z, sprintf( "%32d:", -s ) . $File::Find::name; shift @z if @z > 10; }, $dir ); print "$_\n" for @z;

      jdporter
      The 6th Rule of Perl Club is -- There is no Rule #6.

      Hi Boris, You perl code searches top 10 files on all sub directories of different filesystem. Example., I have /var and /var/log filesystems, my requirement is to search top 10 files under /var but your code searches through /var/log too. is there a way to restrict this in your script.?
Re: finding top 10 largest files
by graff (Chancellor) on Feb 03, 2004 at 02:57 UTC
    If you happen to be on a system with lots of files and directories, you may want to avoid any strategy that involves keeping all the path/file names in a hash. I've been burned by this, using a script that was originally created for CDROMS and trying to use it on DVDS -- one day, somebody actually ran it on a DVD that just happened to contain over a million files, and it brought the system to its knees.

    In this regard, Abigail's original suggestion seems best -- using a pipeline file handle that runs "find" is very fast and economical in terms of memory, and only keeping track of the 10 largest files seen so far will assure that the script won't blow up as the file space gets bigger.

      < Please don't burn me ;-) >

      Since you're on Windows, there's always:

      Start->Search->For Files or Folders
      Pick the Drive, Go to Size, then pick a limit, say 'at least 10000 Kb', then 'Find'
      Sort the results by clicking on 'Size'

      cheers
      zeitgheist

Re: finding top 10 largest files
by pelagic (Priest) on Feb 03, 2004 at 12:07 UTC
    And yet annother possibility I found in my code stash.
    It lists the top n filesizes with all files having these sizes:
    #!/usr/bin/perl use strict; my ($dir_init, $top_n) = @ARGV; my (@ls, @dirs, %top, $size, $obsolete_size); my $sep = "~" x 80; my $form = "%20s %s\n"; my $min_size = 0; my $files = 0; push(@dirs, $dir_init);
    call it like:
    perl fszTop10.pl /u2/integ 10
    and the output looks like:
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Total number of files examined: 2331 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ List of top 10 sized files within /u2/integ/t096351 : 23,836,788 /u2/integ/t096351/tmp/PDF/NV61DE.PDF 5,731,355 /u2/integ/t096351/test/gsp_toCMT.tar.gz 180,042 /u2/integ/t096351/jcs553_functions_ALL.sql.v1 /u2/integ/t096351/jcs553_functions_ALL.sql.v2 /u2/integ/t096351/jcs553_functions_ALL.sql.copy 105,520 /u2/integ/t096351/ccm_ui.log 56,216 /u2/integ/t096351/ccm_eng.log 24,072 /u2/integ/t096351/dead.letter 14,506 /u2/integ/t096351/.ccm.ini 11,492 /u2/integ/t096351/.sh_history 5,111 /u2/integ/t096351/.dtprofile 2,561 /u2/integ/t096351/.Xauthority ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    pelagic

    Update
    I had to fix a bug I found, as I was using the script again.
Re: finding top 10 largest files
by ambrus (Abbot) on Feb 03, 2004 at 09:13 UTC

    I think a good way to do this is to write a Perl script selecting the files larger than, say, 20M; and do not care with the sorting in it. Then you can easily sort|head the resulting few files.

    By the way, I have some old code doing something similar. I have a file with a lot of records, and I have to select the 64 "best" records. Here's the code, with omissions (##)

    ## ... HEADER STUFF ... @sellpt= (0)x64; @sellx= @sellpt; @selly= @sellpt; ## ... MORE HEADER STUFF, OPENING HANDLE "VAR" ... while (<VAR>) { my($p,$x,$y); ## FILLING ($x,$y,$p) WITH DATA FROM RECORDS, MUST GET THE ONE +S WITH LARGEST $p $sellpt[-1]<$p and do { my ($u, $v)= (0, 0+@sellpt); defined ($sellpt[int(($u+$v)/2)]) or die qq([D30] $u $v (@sellpt)); while ($u<$v) { if ($p<$sellpt[int(($u+$v)/2)]) { $u= int(($u+$v)/2)+1 } else { $v= int(($u+$v)/2) }; } @sellpt= (@sellpt[0..$u-1], $p, @sellpt[$u..@sellpt-2] +); @sellx= (@sellx[0..$u-1], $x, @sellx[$u..@sellx-2]); @selly= (@selly[0..$u-1], $y, @selly[$u..@selly-2]); }; ## ... SOME MORE STUFF ... }; ## ... AND SOME MORE, WRITING OUT @sellx,@selly IN SUITABLE FORMAT ...
Re: finding top 10 largest files
by rje (Deacon) on Feb 03, 2004 at 15:08 UTC
    Rather than holding every file in the hash, could you instead only keep the top 10 entries with each iteration? I suppose that might be slow, eh?

    Pseudoperl follows... caveat coder.

    my @topTen = (); while( moreFiles() ) { push @topTen, some_munged_key_using_file_size_and_name(); @topTen = (reverse sort @topTen)[0..9]; }


    Sorry if I'm missing something. My brain's not working well today...
      ... that's what my solution (see 326175) does ...
      Only keep the information for the currently known biggest files.
      pelagic
        Ugh! Yep, you're right, my bad... only, the code is too long for my liking. Can it be shorter? Could I use the operating system to pare down the code a bit by pre-gathering a list of all files? Something like

        my @files = (reverse sort `dir /A-D /S`)[0..9];
        In DOS, or ... (big pause)...
        Aw shoot, in DOS the solution isn't even perl:
        dir /A-D /O-S /S
        That recursively lists all files from the current working directory on, sorted by largest file first. I imagine there's a combination of opts to ls that will do the same thing, eh? Maybe
        ls -alSR (which doesn't sort across directories) ls -alR | sort -k 5 (maybe?) (Except those don't suppress the directory names. Hmmm.)

        Sorry, I meant to write perl, but it came out rather OS-specific... but it's a lot smaller than the perl solution. Is that Appeal To False Laziness?
Re: finding top 10 largest files
by ambrus (Abbot) on Mar 05, 2004 at 14:33 UTC
    Use
    find / -printf "%s\t%p\n" | perl heap
    where heap is the perl program listed at the node Re: Re: Re: Re: Sorting values of nested hash refs.

    The find prints all files together with sizes. The script uses a binary heap to select the 32 largest ones from it.

    For more information, read its thread which is about a similar problem.

    Update: on my system, this finds /proc/kcore as one of the largest files. You'll have to find out how not to include the /proc fliesystem in thesearch. Also, this will list hard links multiple times.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://326053]
Approved by b10m
Front-paged by mitd
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-03-28 13:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found