comment on

I wrote a horribly slow solution to a terribly easy problem. Then I wrote a different solution which ended up even slower. Yes I'm a neophyte. And yes this is Windows, but I think I can do better with the help of some mad monks and I don't want to let PERL down by living with my original poor attempts. Not looking for anyone to code this for me, just to suggest the best approach of how to do it. Anyway...

Problem:
Given a file which has a listing of potential files, one per line. Some on the same server, some in the same directory, some which really exist, some which may not. Example input:

\\serverA\directoryA\FileA
\\serverA\directoryA\FileB
\\serverA\directoryB\subdirectoryA\FileA
\\serverA\directoryB\subdirectoryA\FileB
\\serverB\directoryA\FileA
\\doesntexist\doesntexist\doesntexist

The PERL script should validate if the file exists or not. If it does, it should write the size and date modified attributes for the file, along with the file name. The 3 pieces of information should be pipe delimited. If it does not exist, mark the file name anyway, but put "notfound" for the 2 attributes. So for the input above, the output should look something like this:

\\serverA\directoryA\FileA|25521|10/19/2011 01:32 PM
\\serverA\directoryA\FileB|14288|07/28/2011 09:31 AM
\\serverA\directoryB\subdirectoryA\FileA|13384|07/28/2011 09:30 AM
\\serverA\directoryB\subdirectoryA\FileB|15667|08/05/2011 08:53 AM
\\serverB\directoryA\FileA|15274|08/11/2011 03:15 PM
\\doesntexist\doesntexist\doesntexist|notfound|notfound

Here is my first attempt. running this on an input file of 20,000 entries to just over 3 minutes, which I think is piss poor slow:

use POSIX;


my $timestamp; # eg 10/24/2011 5:01:22 PM
my $size; # size of file will be bytes
my @inp; #array for input files
my $outfile; #second argument the output
my $statfile; #third argument the status

# open status and report in progress. Will abort if status file alread
+y here
if (-e $ARGV[2]) {
die "status file already exists!";
}

open(STAT, ">", $ARGV[2]) || die "can't create statfile $ARGV[2]";
print STAT "inprogress";
close(STAT); 

#quit unless we have the correct number of command-line args
my $num_args = $#ARGV + 1;
if ($num_args != 3) {
print "\nUsage: perl sizeDateValidator.pl <listOfFilesToValidate> <out
+putFile> <statusFile>\n";
open(STAT, ">", $ARGV[2]) || die "can't create statfile $ARGV[2]";
print STAT "error1";
close(STAT); 
exit;
}

#grab files to validate from first argument which should be a readable
+ file
if (-e $ARGV[0]) {
open(FILE, "<", $ARGV[0]) || die "Can't open file $ARGV[0]";
@inp = <FILE>;
chomp @inp;
close FILE;
} else { 
open(STAT, ">", $ARGV[2]) || die "can't create statfile $ARGV[2]";
print STAT "error2";
close(STAT); 
exit;
}

#check to make sure output file doesn't already exist like from previo
+us run. 
#don't want to overwrite anything I shouldn't
$outfile = $ARGV[1];
if (-e $outfile) {
open(STAT, ">", $ARGV[2]) || die "can't create statfile $ARGV[2]";
print STAT "error3";
close(STAT); 
exit;
}

#for each file in list, get size and timestamp
#if the element is not a file, report this
#write this to output file passed in as argument.
open(FILE, ">", $outfile) || die "Can't open file $outfile";
foreach my $files(@inp){
$files =~ s/"//g;
if (-f $files) {
$timestamp = POSIX::strftime("%m/%d/%Y %I:%M %p", localtime(( stat $fi
+les)[9])); # or change localtime to gmtime
$size = (stat $files)[7];
print FILE "$files|$size|$timestamp\n";
}
else
{
print FILE "$files|notfound|notfound\n";
}
}
close(FILE);
open(STAT, ">$ARGV[2]") || die "can't create statfile $ARGV[2]";
print STAT "success";
close(STAT);
[download]

Here was the second way I tried, which took over 10 minutes to run, and that's just embarrassing:

my $num_args; # for inp arg validation
my $statfile; #file for reporting progress of execution; third arg
my $inp; #input file; first arg
my %tempdir; #hash for storing unique file paths
my $foo; #%tempdir keys
my $dircommand2; #dir /-C
my @rawinp2; #array to hold results of dir command
my $direntry2; # for individual @rawinp2 processessing
my $stripfile; # stripped down entry 
my $timestamp2; #catch regex results
my $bytes2; #catch regex results
my $file_name2; #catch regex results
my @array_of_filestats; # holding output of stats e.g. \\swstgsqldb01\
+D$\mktemp\AnaWeekly_20110817.docx|13506|08/17/2011 02:42 PM
my @inpobjects; # holding input objects to compare to
my $outfile; #output file; third arg
my $trimmed; # elements from @array_of_filestats
my $count; # counter to check for non-existing targets
my @pieces; # split results file entries


#Will abort if status file already here
if (-e $ARGV[2]) {
die "status file already exists!";
}

# open status and report in progress
$statfile = $ARGV[2];
open(STAT, ">", $statfile) || die "can't create statfile $statfile";
print STAT "inprogress";
close(STAT); 

#quit unless we have the correct number of command-line args
$num_args = $#ARGV + 1;
if ($num_args != 3) {
print "\nUsage: perl sizeDateValidator.pl <listOfFilesToValidate> <out
+putFile> <statusFile>\n";
open(STAT, ">", $statfile) || die "can't create statfile $statfile";
print STAT "error1";
close(STAT); 
exit;
}

#Get every unique path from input file     
%tempdir;
$inp = $ARGV[0];
open(INP, $inp)||die "Can't open: $inp";
    while(<INP>){
        chomp;
        ($stripfile) = /(.*\\).*$/;
        $tempdir{$stripfile}++;
    }
close(INP);


#run our dir command on each of the individual keys and store in array
foreach $foo (keys %tempdir) {
            $dircommand2 = "dir /-C ";  
            if (-e $foo) {
            $dircommand2 = $dircommand2 . $foo;
            @rawinp2 = `$dircommand2`;
                foreach $direntry2(@rawinp2) {
                    next if $direntry2 =~ /<DIR>/;
                    #next if !($timestamp = ($direntry =~ /^\d{2}\/\d{
+2}\/\d{4}\s+\d{2}:\d{2}\s[AP][M]/g)[0]); #if it doesn's start MM/DD/Y
+YYY then skip
                    next if !(($timestamp2, $bytes2 , $file_name2 ) = 
+(($direntry2 =~ /^(\d{2}\/\d{2}\/\d{4}\s+\d{2}:\d{2}\s[AP][M])\s+(\d*
+)\s+(.*)$/)[0,1,2]));
                    $file_name2 = $foo . $file_name2;
                    $timestamp2 =~ s/\s{2}/ /;
                    push (@array_of_filestats, "$file_name2|$bytes2|$t
+imestamp2\n");
                }
            }
}

# get array of input objects
open(ARGINP, $inp)||die "Can't open: $inp";
    while(<ARGINP>){
    chomp;
    push(@inpobjects,$_);
    }    
close(INP);    


#check to make sure output file doesn't already exist like from previo
+us run. 
#don't want to overwrite anything I shouldn't
$outfile = $ARGV[1];
if (-e $outfile) {
open(STAT, ">", $statfile) || die "can't create statfile $statfile";
print STAT "error3";
close(STAT); 
exit;
}

# if matches are made, print stats to output file; if not record "notf
+ound"
open(FILE, ">", $outfile) || die "Can't open file $outfile";
for ($i = 0; $i < @inpobjects; $i++) {
    chomp $inpobjects[$i];
    $count = 0;
    foreach $trimmed(@array_of_filestats) {
        $count++;
        chomp $trimmed;
        @pieces = split(/\|/, $trimmed);
        #print "inpobject: $inpobjects[$i]\npieces: $pieces[0]\n\n";
        if ($inpobjects[$i] eq $pieces[0]){
        print FILE "$trimmed\n";
        last;
        } else {
        if (($#array_of_filestats + 1) == $count) {
        print FILE "$inpobjects[$i]|notfound|notfound\n";
        }
        }
    }   
}
close(FILE);
open(STAT, ">", $statfile) || die "can't create statfile $statfile";
print STAT "success";
close(STAT);
[download]

Any monks out there that can suggest a better, faster approach?
many thanks in advance

In reply to sizeDateValidator.pl is horribly slow by msensay

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.