Increase script speed

ctrevgo_learn_perl has asked for the wisdom of the Perl Monks concerning the following question:

I need a little assistance with trying to rewrite code....i am new to perl pieced this together but i need this to go a lot faster. for 258 records it took 67 minutes. I need to do at least a few million directory look ups using the code below. but this will take forever to run at the current speed. i got it to read a file, lookup mailbox and write to file after completed but the speed and maybe structure maybe be wrong Can anyone help with conversion or pointers......still learning.........

#!/usr/local/bin/perl

use strict;
use warnings;
use File::Copy;
use File::Slurp qw(read_dir);
use File::Find;
use Data::Dumper;

# Dry run. 0 will execute
use constant DRY=>qq(1);
# Hardcoding the file path as of now
use constant FILE=>qq(test.txt);
# Successful operation - stats file
use constant STATS_FILE => qq(mbox_stats.txt);
# Failed mailbox operations
use constant MBOX_UNKNOWN => qq(mbox_unknown.txt);
#TOP Level volume directory
use constant VOL=>qq(/home/folder/);
use constant DEBUG => 1;

#Total space saving if removed
my $tsavings = 0;

#Assuming \n as the eod of line
local $/ = "\n";

# Read the volume directory and load all the subdirectories under it
opendir (DIR, VOL)
       or die "Unable to open VOLUME: $!";
my @dh = grep { !/^\.{1,2}$/ } readdir (DIR);
closedir (DIR);

# Open the file containing the id and mbox for reading
open my $file, '<', FILE;
# Open the file to log unknown mailboxes
open MB_UNK, ">", MBOX_UNKNOWN or die $!;
#Open the file to log stats
open STATS, ">", STATS_FILE or die $!;

# Loop through each lines of the file. We need to read each directory.
while(<$file>) {
    chomp;# Strip the line break character at the end of the line
  # Strip the mbox and id, assuming the file contains mbox:id
    my($mbox,$id)=split(',',$_);
        if (!$mbox){
    print "No mbox for $id\n";
    next;
        }
        if (!$id) {
      print "No id\n";
        next;
    }
    print ("Processing mailbox search for id $id \n") if DEBUG;
  #Parse the mbox hash to retrieve the two directory names
  my ($u,$v,$mbox_path);
    if ( $mbox =~ /\d*(\d{2})(\d{2})$/) {
            $u = reverse $1;
            $v = reverse $2;
    }
  # Build the volume path. So that now we have a pach something like t
+his
    # VOL/v*/$mb - we still have to determine v* by reading the direct
+ory of VOL
    my $mb = qq($v/$u/$mbox);
    my @dir = map { VOL . '/' . $_ ."/$mb" } @dh;
    foreach(@dir) {
        if (-d $_) {
            $mbox_path = $_;
            my $dirsize = 0;# Setting size to 0 each loop
            find(sub { $dirsize += -f $_ ? -s _ : 0}, "$mbox_path");
            $tsavings +=$dirsize;# Adding dirsize to total
        #    print "Path to mail directory $mbox_path . Directory size
+ =  $dirsize bytes\n";
        #    (DRY?print "Doing dry run only\n":movemailbox($mbox_path)
+);
            print STATS "$mbox:$id:$mbox_path:$dirsize\n";
        }
    }
    if (!$mbox_path) {
        #    print "No Mailbox found for $id \n";
            print MB_UNK "$mbox:$id\n";
    }
}
close (MB_UNK);
close (STATS);
close ($file);

print "Total size of directory to be freed $tsavings\n";

sub movemailbox {
    my $m = shift;
    move($m, $m."-trash" ) or die "move $m failed: $!";
    return;
}
__END__
[download]

Comment on Increase script speed Download Code

Replies are listed 'Best First'.

Re: Increase script speed
by CountZero (Bishop) on Jun 06, 2015 at 08:28 UTC

test.txt

Assuming your test.txt contains no duplicates (removing duplicates would be a possible first optimization step), it also means that the number of files in your mail file directory must be even bigger and walking these directories multiple times will indeed take forever as those are timewise "expensive" operations.

I'd suggest to walk this directory structure once only and put the information found into a database with the full path as the key, and the ID and size as non-key fields with the ID field indexed.

Then building your "files_to_delete" list becomes a simple SQL exercise and as databases are optimized to handle large datasets, you will probably see a significant speed-up.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

[reply]
[d/l]
[select]

Re: Increase script speed
by Athanasius (Archbishop) on Jun 06, 2015 at 04:07 UTC

Hello ctrevgo_learn_perl, and welcome to the Monastery!

As graff says, you need to supply more information. For example, what is the format of a typical line read from the file “test.txt”? There seems to be some confusion:

# Strip the mbox and id, assuming the file contains mbox:id my($mbox,$id)=split(',',$_);
[download]

Aside from the fact that there is no stripping happening here, the comment specifies that the initial two items will be separated by a colon, but the code assumes a comma. If the comment is correct but the data line happens to contain a comma somewhere, this will “work” without warnings but will produce spurious results.

I note also that the script has:

use File::Slurp qw(read_dir); ... use Data::Dumper;
[download]

but neither module appears to be actually used?

And an observation on style. The use constant declarations can be written more clearly, like this:

use constant
{
    DRY          => 1,                   # Dry run. 0 will execute
    FILE         => 'test.txt',          # Hardcoding the file path as
+ of now
    STATS_FILE   => 'mbox_stats.txt',    # Successful operation - stat
+s file
    MBOX_UNKNOWN => 'mbox_unknown.txt',  # Failed mailbox operations
    VOL          => '/home/folder/',     # TOP Level volume directory
    DEBUG        => 1,
};
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Increase script speed
by graff (Chancellor) on Jun 06, 2015 at 01:06 UTC

You seem to be using "find" on each iteration of a list of lines from your one input file. Are you sure you aren't doing repetitions? (When directory trees are really large, full traversals using File::Find can be really time consuming. Get a clear idea of what tree(s) you need to traverse, and make sure you only do that once (storing stuff in a hash or array as needed).

Apart from that, the timing issue would seem to depend mainly on the size of your input file, and the size of the directory trees you're traversing. Can you give us some stats on that?

[reply]

Re^2: Increase script speed

by ctrevgo_learn_perl (Initiate) on Jun 06, 2015 at 05:43 UTC

purpose remove old account directories to save space.

Initial State

Here is a what test.txt contains (mbox,userid only) the file is about 2gb contains about 10+ million lines.

Expected results.

read test.txt into script line by line

check each directory volume on server looking for mailbox (mail boxes are hashed in reverse if you had a id that had a mailbox # 123456789 the following should be found in a directory like /home/folder/volume#/98/76/123456789 there are multiple volumes.

once the mailbox is found build a path to mailbox for later removal

if mbox is found move mailbox directory from /home/folder/volume#/98/76/123456789 to /home/folder/volume#/98/76/123456789-trash (for later deletion by another script if space saving is worth it)

write the actions to a file called mbox_stats.txt with the following information

id,full mailbox path ( for example /home/folder/volume#/98/76/123456789),directory size

add up each directory found in mbox_stat to determine total savings if removed

[reply]

Re^3: Increase script speed

by graff (Chancellor) on Jun 07, 2015 at 01:34 UTC

Count Zero

As for traversing the directory tree to get information, you might want to have a look at a script that I posted here a while back: Get useful info about a directory tree. It was designed to do the fastest possible traversal of a directory, and produce a one-line summary for every directory in the tree. You could use it as-is to get summaries for (particular portions or volumes of) your system (the man page is included in the script), or you can adapt the approach used there to your own needs.

[reply]