PerlScholar has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I have a script that performs some checks on files. The problem is that its taking hours to run when I was hoping to run it in mins. The first check compares if files exist in 2 directories using List::Compare module. The second check compares the checksums of the files using Digest::MD5 to make sure they are the same and some of you suggested this is the likely cause of my problem. Would appreciate any advice regarding this. Many thanks!

use strict;
use warnings;
use Digest::MD5 qw(md5 md5_hex md5_base64);

#2nd part of script

my %dir1 = getChksums("$dir1");
#print Dumper \%dir1;
my%dir2 = getChksums("$dir2");
#print Dumper \%dir2;

my $num_errors = 0;
    foreach my $file (keys %dir1){
       if (!exists ($dir2{$file}) ){
            #print "file: $file doesn't exist in 2nd directory\n";
        }
        elsif ($dir1{$file} ne $dir2{$file}){
            print "MD5 did not match for: $file\n";
            $num_errors++;
        }
            #print "total errors = $num_errors\n";
        else {
        }
            #print "$file\n";
    }
            sub getChksums {
                my $path = shift;
                my %file2chksum;
                opendir (INDIR, $path) or die ("Error opening: $path");
                my @files = grep {-f "$path/$_"}readdir INDIR;
                close INDIR;
                    foreach my $file (@files){
                    open (IN, '<', "$path/$file") or die ("Error opening: $path/$file");
                    $file2chksum{$file} = md5_hex(<IN>);
                    #print "$file $file2cksum{$file}\n";
                    close IN;
                    }
                        return %file2chksum;
            }

Replies are listed 'Best First'.
Re: Digest::MD5 seems to slow down script
by Ratazong (Monsignor) on Sep 15, 2010 at 10:54 UTC
    One observation: You calculate the checksums over both directories in advance. You could save much time if you only calculate it when required. Meaning:
    • only if the same filename is in dir1 and dir2, calculate and compare both checksums
    • you may perform additional (fast) checks before, e.g. only calculate the checksum if the filesize of both files is the same
    HTH, Rata
Re: Digest::MD5 seems to slow down script
by moritz (Cardinal) on Sep 15, 2010 at 11:37 UTC
    As pointed out in the chatterbox, you could gain much by storing MD5 sums and file size + modification date, and on subsequent runs only recalculate it if size or modification time changes.

    Additionally the construct md5_hex(<IN>) is not optimal, because it reads the file line by line (even though there is no reason to split it up into lines at all), and loads the whole thing into memory before starting to process. Using $ctx->addfile(*IN) (as described in the Digest::MD5 documentation) instead should be a lot faster.

    Finally you should probably set binmode on the IN file handle.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: Digest::MD5 seems to slow down script
by zwon (Abbot) on Sep 15, 2010 at 11:53 UTC

    I would recommend you to take a look onto thread MD 5 hash comparison/checker. And especially this reply. There's no need to compare MD5 sums, just compare the files.

Re: Digest::MD5 seems to slow down script
by Marshall (Canon) on Sep 15, 2010 at 12:27 UTC
    I'm not sure exactly what you are doing, but often just comparing the -M times is enough if you are trying to ascertain whether a copy is still a "copy" of the original.

    The file test operators to determine modification time and size do not have to read the whole file, just the directory information and therefore are fast compared to MD5 checksum calculations. doing a bit by bit compare of two files may even be faster than calculating the MD5 (no math required, but requires more I/O bandwidth).

    Anyway I would start with the idea of comparing -M times and then answer the question: "why wouldn't that work" and that will lead to alternate algorithms if needed. There are all kind of scenarios here, the -M time could be different but the content of the files are the same (app re-wrote file with no changes). But I suspect vast speed enhancements can be made if you figure out how,why,when the files could be different and use that application specific knowledge to tailor your comparison algorithm. Anyway the modification date is normally the most important thing and file size the second most important, aside from the actual filename of course!

Re: Digest::MD5 seems to slow down script
by zentara (Cardinal) on Sep 16, 2010 at 10:02 UTC
    As zwon's reply pointed out, the way to speed up file comparisons is to first directly sample the files themselves, say a1k buffer of bytes, then run an md5sum sum comparison ONLY if the buffers match.

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: Digest::MD5 seems to slow down script
by DrHyde (Prior) on Sep 16, 2010 at 09:17 UTC
    What you've discovered is that reading a whole bunch of files and then doing some calculations on their contents takes time.