sizeDateValidator.pl is horribly slow

msensay has asked for the wisdom of the Perl Monks concerning the following question:

I wrote a horribly slow solution to a terribly easy problem. Then I wrote a different solution which ended up even slower. Yes I'm a neophyte. And yes this is Windows, but I think I can do better with the help of some mad monks and I don't want to let PERL down by living with my original poor attempts. Not looking for anyone to code this for me, just to suggest the best approach of how to do it. Anyway...

Problem:
Given a file which has a listing of potential files, one per line. Some on the same server, some in the same directory, some which really exist, some which may not. Example input:

\\serverA\directoryA\FileA
\\serverA\directoryA\FileB
\\serverA\directoryB\subdirectoryA\FileA
\\serverA\directoryB\subdirectoryA\FileB
\\serverB\directoryA\FileA
\\doesntexist\doesntexist\doesntexist

The PERL script should validate if the file exists or not. If it does, it should write the size and date modified attributes for the file, along with the file name. The 3 pieces of information should be pipe delimited. If it does not exist, mark the file name anyway, but put "notfound" for the 2 attributes. So for the input above, the output should look something like this:

\\serverA\directoryA\FileA|25521|10/19/2011 01:32 PM
\\serverA\directoryA\FileB|14288|07/28/2011 09:31 AM
\\serverA\directoryB\subdirectoryA\FileA|13384|07/28/2011 09:30 AM
\\serverA\directoryB\subdirectoryA\FileB|15667|08/05/2011 08:53 AM
\\serverB\directoryA\FileA|15274|08/11/2011 03:15 PM
\\doesntexist\doesntexist\doesntexist|notfound|notfound

Here is my first attempt. running this on an input file of 20,000 entries to just over 3 minutes, which I think is piss poor slow:

use POSIX;


my $timestamp; # eg 10/24/2011 5:01:22 PM
my $size; # size of file will be bytes
my @inp; #array for input files
my $outfile; #second argument the output
my $statfile; #third argument the status

# open status and report in progress. Will abort if status file alread
+y here
if (-e $ARGV[2]) {
die "status file already exists!";
}

open(STAT, ">", $ARGV[2]) || die "can't create statfile $ARGV[2]";
print STAT "inprogress";
close(STAT); 

#quit unless we have the correct number of command-line args
my $num_args = $#ARGV + 1;
if ($num_args != 3) {
print "\nUsage: perl sizeDateValidator.pl <listOfFilesToValidate> <out
+putFile> <statusFile>\n";
open(STAT, ">", $ARGV[2]) || die "can't create statfile $ARGV[2]";
print STAT "error1";
close(STAT); 
exit;
}

#grab files to validate from first argument which should be a readable
+ file
if (-e $ARGV[0]) {
open(FILE, "<", $ARGV[0]) || die "Can't open file $ARGV[0]";
@inp = <FILE>;
chomp @inp;
close FILE;
} else { 
open(STAT, ">", $ARGV[2]) || die "can't create statfile $ARGV[2]";
print STAT "error2";
close(STAT); 
exit;
}

#check to make sure output file doesn't already exist like from previo
+us run. 
#don't want to overwrite anything I shouldn't
$outfile = $ARGV[1];
if (-e $outfile) {
open(STAT, ">", $ARGV[2]) || die "can't create statfile $ARGV[2]";
print STAT "error3";
close(STAT); 
exit;
}

#for each file in list, get size and timestamp
#if the element is not a file, report this
#write this to output file passed in as argument.
open(FILE, ">", $outfile) || die "Can't open file $outfile";
foreach my $files(@inp){
$files =~ s/"//g;
if (-f $files) {
$timestamp = POSIX::strftime("%m/%d/%Y %I:%M %p", localtime(( stat $fi
+les)[9])); # or change localtime to gmtime
$size = (stat $files)[7];
print FILE "$files|$size|$timestamp\n";
}
else
{
print FILE "$files|notfound|notfound\n";
}
}
close(FILE);
open(STAT, ">$ARGV[2]") || die "can't create statfile $ARGV[2]";
print STAT "success";
close(STAT);
[download]

Here was the second way I tried, which took over 10 minutes to run, and that's just embarrassing:

my $num_args; # for inp arg validation
my $statfile; #file for reporting progress of execution; third arg
my $inp; #input file; first arg
my %tempdir; #hash for storing unique file paths
my $foo; #%tempdir keys
my $dircommand2; #dir /-C
my @rawinp2; #array to hold results of dir command
my $direntry2; # for individual @rawinp2 processessing
my $stripfile; # stripped down entry 
my $timestamp2; #catch regex results
my $bytes2; #catch regex results
my $file_name2; #catch regex results
my @array_of_filestats; # holding output of stats e.g. \\swstgsqldb01\
+D$\mktemp\AnaWeekly_20110817.docx|13506|08/17/2011 02:42 PM
my @inpobjects; # holding input objects to compare to
my $outfile; #output file; third arg
my $trimmed; # elements from @array_of_filestats
my $count; # counter to check for non-existing targets
my @pieces; # split results file entries


#Will abort if status file already here
if (-e $ARGV[2]) {
die "status file already exists!";
}

# open status and report in progress
$statfile = $ARGV[2];
open(STAT, ">", $statfile) || die "can't create statfile $statfile";
print STAT "inprogress";
close(STAT); 

#quit unless we have the correct number of command-line args
$num_args = $#ARGV + 1;
if ($num_args != 3) {
print "\nUsage: perl sizeDateValidator.pl <listOfFilesToValidate> <out
+putFile> <statusFile>\n";
open(STAT, ">", $statfile) || die "can't create statfile $statfile";
print STAT "error1";
close(STAT); 
exit;
}

#Get every unique path from input file     
%tempdir;
$inp = $ARGV[0];
open(INP, $inp)||die "Can't open: $inp";
    while(<INP>){
        chomp;
        ($stripfile) = /(.*\\).*$/;
        $tempdir{$stripfile}++;
    }
close(INP);


#run our dir command on each of the individual keys and store in array
foreach $foo (keys %tempdir) {
            $dircommand2 = "dir /-C ";  
            if (-e $foo) {
            $dircommand2 = $dircommand2 . $foo;
            @rawinp2 = `$dircommand2`;
                foreach $direntry2(@rawinp2) {
                    next if $direntry2 =~ /<DIR>/;
                    #next if !($timestamp = ($direntry =~ /^\d{2}\/\d{
+2}\/\d{4}\s+\d{2}:\d{2}\s[AP][M]/g)[0]); #if it doesn's start MM/DD/Y
+YYY then skip
                    next if !(($timestamp2, $bytes2 , $file_name2 ) = 
+(($direntry2 =~ /^(\d{2}\/\d{2}\/\d{4}\s+\d{2}:\d{2}\s[AP][M])\s+(\d*
+)\s+(.*)$/)[0,1,2]));
                    $file_name2 = $foo . $file_name2;
                    $timestamp2 =~ s/\s{2}/ /;
                    push (@array_of_filestats, "$file_name2|$bytes2|$t
+imestamp2\n");
                }
            }
}

# get array of input objects
open(ARGINP, $inp)||die "Can't open: $inp";
    while(<ARGINP>){
    chomp;
    push(@inpobjects,$_);
    }    
close(INP);    


#check to make sure output file doesn't already exist like from previo
+us run. 
#don't want to overwrite anything I shouldn't
$outfile = $ARGV[1];
if (-e $outfile) {
open(STAT, ">", $statfile) || die "can't create statfile $statfile";
print STAT "error3";
close(STAT); 
exit;
}

# if matches are made, print stats to output file; if not record "notf
+ound"
open(FILE, ">", $outfile) || die "Can't open file $outfile";
for ($i = 0; $i < @inpobjects; $i++) {
    chomp $inpobjects[$i];
    $count = 0;
    foreach $trimmed(@array_of_filestats) {
        $count++;
        chomp $trimmed;
        @pieces = split(/\|/, $trimmed);
        #print "inpobject: $inpobjects[$i]\npieces: $pieces[0]\n\n";
        if ($inpobjects[$i] eq $pieces[0]){
        print FILE "$trimmed\n";
        last;
        } else {
        if (($#array_of_filestats + 1) == $count) {
        print FILE "$inpobjects[$i]|notfound|notfound\n";
        }
        }
    }   
}
close(FILE);
open(STAT, ">", $statfile) || die "can't create statfile $statfile";
print STAT "success";
close(STAT);
[download]

Any monks out there that can suggest a better, faster approach?
many thanks in advance

Comment on sizeDateValidator.pl is horribly slow Select or Download Code

Replies are listed 'Best First'.
Re: sizeDateValidator.pl is horribly slow by graff (Chancellor) on Nov 06, 2011 at 03:06 UTC
In the first try, you are calling stat numerous times on each file, and that's wasting some amount of time. Call stat once per file, and save all its information for your various actions. As for how long it should take to scan 20,000 files, what sort of time span are you expecting, and what sort of evidence (what sorts of processes) lead you to expect that? There are some other trivial oddities in your first script -- I expect they don't affect the timing much (if at all), but they detract from the overall coherence of the code. Oh, and consistent indenting is useful... Here's how I would do it: use POSIX; # Get argv handling out of the way first... if ( @ARGV != 3 or ! -f $ARGV[0] ) { die "Usage: perl $0 FileListToValidate OutFile StatusFile\n"; } # Next take care of all the i/o file handling... if ( -e $ARGV[2] ) { die "$ARGV[2] already exists -- I will not overwrite it\n"; } open( STAT, '>', $ARG[2] ) or die "Can't write status info to $ARGV[2] +: $!\n"; if ( ! open( OUT, '>', $ARGV[1] ) { print STAT "error: can't write output to $ARGV[1]: $!\n"; exit; } if ( ! open( IN, '<', $ARGV[0] ) { print STAT "error: can't open $ARGV[0] for input $!\n"; exit; } # Now get to work... my @inpList = <IN>; chomp @inpList; for ( @inpList ) { # let $_ hold the file name tr/"//d; # get rid of double-quotes my @stats = stat; # do this just once (works on $_ by default) if ( ! @stats ) { # empty list means stat failed print OUT join( '\|', $_, ( 'notfound' ) x 2 ), "\n"; } else { print OUT join( '\|', $_, $stats[7], POSIX::strftime( "%m/%d/%Y %I:%M %p", localtime( $stats[9] + )), "\n"; } } print STAT "success\n"; [download] That eliminates a lot of useless variable creations and value assignments, but I think reducing the multiple stat calls per file to just one will be the thing that has a noticeable effect. Personally, I'd go with just two command line args -- printing error messages (and even a "success" message) to stderr should suffice, so you just need the input list and the name to use for the output list (and you eliminate two possible causes of failure). As for the second try, processing the output of some other command is bound to take longer (and can cause more trouble). Don't do that when a perl internal function can do the same thing.	[reply] [d/l]
Re: sizeDateValidator.pl is horribly slow by cavac (Prior) on Nov 06, 2011 at 03:03 UTC
First of all, you should use strict; and use warnings;. On your first attempt, you do something like this: `$timestamp = POSIX::strftime("%m/%d/%Y %I:%M %p", localtime(( stat $fi +les)[9])); # or change localtime to gmtime $size = (stat $files)[7];` [download] You run stat twice, forcing the operating system to grab the same data twice. Why? Something like this should do the trick in less time: `# Get both pieces of information we need my ($size, $timestamp) = ((stat $files)[7, 9]); # Convert timestamp $timestamp = POSIX::strftime("%m/%d/%Y %I:%M %p", localtime($timestamp +);` [download] Don't use '#ff0000': use Acme::AutoColor; my $redcolor = RED(); All colors subject to change without notice.	[reply] [d/l] [select]
Re^2: sizeDateValidator.pl is horribly slow by msensay (Novice) on Nov 07, 2011 at 14:54 UTC
Thanks cavac for bestowing your wisdom upon me. The script runs almost a minute faster on the 20K recordset I am using to test. Prior to your suggestion the script was taking 3minutes 15seconds, and now it runs in 2min 20 seconds. A very significant improvement.	[reply]
Re: sizeDateValidator.pl is horribly slow by CountZero (Bishop) on Nov 06, 2011 at 08:53 UTC
Checking 20000 files in 3 minutes means you do more than 100 `stat` checks per second (actually you do double that number because you `stat` twice for each file) many even on a server link. I think that is pretty fast. What kind of performance were you expecting? Update: I ran a little test myself. `use Modern::Perl; use Time::HiRes qw/time/; open my $LIST, '<', 'M:/YP/list.txt' or die $!; my @stats; my $start = time; while (<$LIST>) { chomp; push @stats, stat "M:/YP/$_"; } my $end = time; say 'stat checks took ', $end - $start, ' seconds.'; say 'That is ', 540 / ($end - $start), ' checks per second.';` [download] I am checking 540 existing files myself over my local network (the "M" drive is on a NAS) and got about 54 `stat` checks per second. So your performance is actually quite good! CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^2: sizeDateValidator.pl is horribly slow by Anonymous Monk on Nov 06, 2011 at 17:25 UTC
Thanks Monks! I'm going to go with the suggestion for the single stat call, and let you know how much faster it runs. I wont be back in this environmen til tomorrow but should get time to do it then. BTW, my goal was to have it run in less than a minute, but sounds like that may have been wishful thinking. If I can get it down to < 2 min I'll be happy.	[reply]