ctrevgo_learn_perl has asked for the wisdom of the Perl Monks concerning the following question:

I need a little assistance with trying to rewrite code....i am new to perl pieced this together but i need this to go a lot faster. for 258 records it took 67 minutes. I need to do at least a few million directory look ups using the code below. but this will take forever to run at the current speed. i got it to read a file, lookup mailbox and write to file after completed but the speed and maybe structure maybe be wrong Can anyone help with conversion or pointers......still learning.........

#!/usr/local/bin/perl use strict; use warnings; use File::Copy; use File::Slurp qw(read_dir); use File::Find; use Data::Dumper; # Dry run. 0 will execute use constant DRY=>qq(1); # Hardcoding the file path as of now use constant FILE=>qq(test.txt); # Successful operation - stats file use constant STATS_FILE => qq(mbox_stats.txt); # Failed mailbox operations use constant MBOX_UNKNOWN => qq(mbox_unknown.txt); #TOP Level volume directory use constant VOL=>qq(/home/folder/); use constant DEBUG => 1; #Total space saving if removed my $tsavings = 0; #Assuming \n as the eod of line local $/ = "\n"; # Read the volume directory and load all the subdirectories under it opendir (DIR, VOL) or die "Unable to open VOLUME: $!"; my @dh = grep { !/^\.{1,2}$/ } readdir (DIR); closedir (DIR); # Open the file containing the id and mbox for reading open my $file, '<', FILE; # Open the file to log unknown mailboxes open MB_UNK, ">", MBOX_UNKNOWN or die $!; #Open the file to log stats open STATS, ">", STATS_FILE or die $!; # Loop through each lines of the file. We need to read each directory. while(<$file>) { chomp;# Strip the line break character at the end of the line # Strip the mbox and id, assuming the file contains mbox:id my($mbox,$id)=split(',',$_); if (!$mbox){ print "No mbox for $id\n"; next; } if (!$id) { print "No id\n"; next; } print ("Processing mailbox search for id $id \n") if DEBUG; #Parse the mbox hash to retrieve the two directory names my ($u,$v,$mbox_path); if ( $mbox =~ /\d*(\d{2})(\d{2})$/) { $u = reverse $1; $v = reverse $2; } # Build the volume path. So that now we have a pach something like t +his # VOL/v*/$mb - we still have to determine v* by reading the direct +ory of VOL my $mb = qq($v/$u/$mbox); my @dir = map { VOL . '/' . $_ ."/$mb" } @dh; foreach(@dir) { if (-d $_) { $mbox_path = $_; my $dirsize = 0;# Setting size to 0 each loop find(sub { $dirsize += -f $_ ? -s _ : 0}, "$mbox_path"); $tsavings +=$dirsize;# Adding dirsize to total # print "Path to mail directory $mbox_path . Directory size + = $dirsize bytes\n"; # (DRY?print "Doing dry run only\n":movemailbox($mbox_path) +); print STATS "$mbox:$id:$mbox_path:$dirsize\n"; } } if (!$mbox_path) { # print "No Mailbox found for $id \n"; print MB_UNK "$mbox:$id\n"; } } close (MB_UNK); close (STATS); close ($file); print "Total size of directory to be freed $tsavings\n"; sub movemailbox { my $m = shift; move($m, $m."-trash" ) or die "move $m failed: $!"; return; } __END__

Replies are listed 'Best First'.
Re: Increase script speed
by CountZero (Bishop) on Jun 06, 2015 at 08:28 UTC
    So you have a 10+ million lines long file test.txt with mailbox IDs that need to be mapped into your directory structure in order to delete an equally (or even larger) huge number of files.

    Assuming your test.txt contains no duplicates (removing duplicates would be a possible first optimization step), it also means that the number of files in your mail file directory must be even bigger and walking these directories multiple times will indeed take forever as those are timewise "expensive" operations.

    I'd suggest to walk this directory structure once only and put the information found into a database with the full path as the key, and the ID and size as non-key fields with the ID field indexed.

    Then building your "files_to_delete" list becomes a simple SQL exercise and as databases are optimized to handle large datasets, you will probably see a significant speed-up.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Increase script speed
by Athanasius (Archbishop) on Jun 06, 2015 at 04:07 UTC

    Hello ctrevgo_learn_perl, and welcome to the Monastery!

    As graff says, you need to supply more information. For example, what is the format of a typical line read from the file “test.txt”? There seems to be some confusion:

    # Strip the mbox and id, assuming the file contains mbox:id my($mbox,$id)=split(',',$_);

    Aside from the fact that there is no stripping happening here, the comment specifies that the initial two items will be separated by a colon, but the code assumes a comma. If the comment is correct but the data line happens to contain a comma somewhere, this will “work” without warnings but will produce spurious results.

    I note also that the script has:

    use File::Slurp qw(read_dir); ... use Data::Dumper;

    but neither module appears to be actually used?

    And an observation on style. The use constant declarations can be written more clearly, like this:

    use constant { DRY => 1, # Dry run. 0 will execute FILE => 'test.txt', # Hardcoding the file path as + of now STATS_FILE => 'mbox_stats.txt', # Successful operation - stat +s file MBOX_UNKNOWN => 'mbox_unknown.txt', # Failed mailbox operations VOL => '/home/folder/', # TOP Level volume directory DEBUG => 1, };

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Increase script speed
by graff (Chancellor) on Jun 06, 2015 at 01:06 UTC
    It's not clear what you're trying to accomplish - a brief explanation of "initial state" and "desired result state" would be helpful (e.g. "start with a directory containing, say, 20000 mbox files" and "when done, there should be a (set of) file(s) with 20000 lines with the following info about the mboxes").

    You seem to be using "find" on each iteration of a list of lines from your one input file. Are you sure you aren't doing repetitions? (When directory trees are really large, full traversals using File::Find can be really time consuming. Get a clear idea of what tree(s) you need to traverse, and make sure you only do that once (storing stuff in a hash or array as needed).

    Apart from that, the timing issue would seem to depend mainly on the size of your input file, and the size of the directory trees you're traversing. Can you give us some stats on that?

      purpose remove old account directories to save space.

      Initial State

      Here is a what test.txt contains (mbox,userid only) the file is about 2gb contains about 10+ million lines.

      Expected results.

      read test.txt into script line by line

      check each directory volume on server looking for mailbox (mail boxes are hashed in reverse if you had a id that had a mailbox # 123456789 the following should be found in a directory like /home/folder/volume#/98/76/123456789 there are multiple volumes.

      once the mailbox is found build a path to mailbox for later removal

      if mbox is found move mailbox directory from /home/folder/volume#/98/76/123456789 to /home/folder/volume#/98/76/123456789-trash (for later deletion by another script if space saving is worth it)

      write the actions to a file called mbox_stats.txt with the following information

      id,full mailbox path ( for example /home/folder/volume#/98/76/123456789),directory size

      add up each directory found in mbox_stat to determine total savings if removed

        As mentioned by Count Zero below, the best bet for the quantities involved will be some sort of indexed database for storing the info you want about each mbox path. Something like sqlite should do reasonably well, and will be easy to put in place.

        As for traversing the directory tree to get information, you might want to have a look at a script that I posted here a while back: Get useful info about a directory tree. It was designed to do the fastest possible traversal of a directory, and produce a one-line summary for every directory in the tree. You could use it as-is to get summaries for (particular portions or volumes of) your system (the man page is included in the script), or you can adapt the approach used there to your own needs.