Pattern matching speed

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a small script to process my web proxy log to replace the DOS bat file that does it now. The bat file outputs 5 different text files (per log) based on 5 successive findstr calls of the form:

for %%f in (htt*) do findstr "mail.yahoo aolmail hotmail" %%f > %%f.ma
+il
for %%f in (htt*) do findstr "some other string" %%f > %%f.mail
[download]

etc. I figured that since the bat file is essentially reading these files in 5 times, I can write a perl script that reads the file once and does all 5 comparisons on each line. Since I'd only be reading the files 1 time, it should be more efficient. Imagine my surprise after writing the script only to find that the Perl version takes much longer (2x) as long to complete as the Bat file. Can anyone shed some light on this? In particularly, I'm interested if there's a more efficient way to do pattern matching. Here's the script...

use Cwd;

@logs = <@ARGV>;
%reports=();
%results=();

#first, read in the input file to populate our data structure
#the structure is a hash of arrays where the hash key is the name for 
+the output file
#the first element of the array is the string we're looking for, the s
+econd is the email,
#address, and the third is the expiration date

#get the current date for use in the upcoming loop
($sec,$min,$hr,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(time);
+ 
#apply the offsets to the year and month so we can do a straight compa
+rison
$year = $year +1900;
$mon = $mon +1;

$path = cwd;


open (INPUT_IN, "bat/FindSitesInput.txt")|| die "Can't open input file
+! :$!";

while (defined($currentLine = <INPUT_IN>)) 
{
    $_=$currentLine;
    if(! m/^\#/)
    {
        #split currentLine into tab-delimited tokens
        @tokens = split("\t",$currentLine);
        #only create an entry if the report is still valid
        
        
        if(!isExpired($tokens[5]))
        {
            #since we're using references, must declare temp as "my" t
+o ensure we're creating
            #an object that is local to this block of code.
            my @temp = ($tokens[1], $tokens[2], $tokens[3], $tokens[4]
+, $tokens[5], $tokens[6], $tokens[7], $tokens[8], $tokens[9]);
            #can only store a reference to an array in the hash (not t
+he array itself)
            $reports{$tokens[0]}=\@temp;
            #initialize the results structure here
            my @data=();
            $results{$tokens[0]}=\@data;
        }
    }
}


close (INPUT_IN) || die "Can't close input file: $!";



#now run the reports on each log file
foreach $logfile(@logs)
{
    print "\nSearching $logfile\n";
    open (LOG_IN, "$logfile")|| die "Can't open $logfile! :$!";
    while (defined($currentLine = <LOG_IN>)) 
    {    
    
        #Only want to search the files once (obviously)
        #therefore must apply all the tests in the "reports" datastruc
+ture
        #to each line in the file.
        
        #logs are space delimited
        @tokens = split(" ",$currentLine);
                
        foreach $rep (keys %reports)
        {
            
            @comparisons = split(" ", $reports{$rep}[0]);
            $match = 0;
            foreach $item (@comparisons)
            {
                if($reports{$rep}[1] eq 'site' && !$match)
                {        
                    $_=$tokens[6];
                    if(/$item/)
                    {
                        $match = 1;
                    }
                }
                elsif ($reports{$rep}[1] eq 'ip' && !$match)
                {
                    $_=$tokens[2];
                    if(/$item/)
                    {
                        $match = 1;
                    }
                }
            }
            if($match && $reports{$rep}[2] eq 'normal')
            {
                push @{$results{$rep}}, $currentLine;
            }
            elsif(!$match && $reports{$rep}[2] eq 'reverse')
            {
                push @{$results{$rep}}, $currentLine;
            }
        }
        
    }
    close (LOG_IN) || die "Can't close log file: $!";
    
    #now write the output files for each report for this log
    foreach $rep (keys %results)
    {
        if(! (-d $reports{$rep}[5]))
        {
            system("mkdir $reports{$rep}[5]");
        }
        open (OUTPUT, ">$reports{$rep}[5]/$logfile-$rep")|| die "Can't
+ open $reports{$rep}[5]/$logfile-$rep! :$!";
        print OUTPUT @{$results{$rep}};
        if(close OUTPUT) 
        {    
            if(!($reports{$rep}[3] eq 'none'))
            {
                system("c:/blat/blat \"$path/$reports{$rep}[5]/$logfil
+e-$rep\" -t \"$reports{$rep}[3]\"");
            }
            if(($reports{$rep}[6] eq 'true'))
            {
                #call IPandBytes.pl
                system("perl bat/ipandbytes.pl $reports{$rep}[5]/$logf
+ile-$rep");
                #remove file if desired
                if(($reports{$rep}[7] eq 'false'))
                {
                    system("del $reports{$rep}[5]\\$logfile-$rep");
                }
            }
            if(($reports{$rep}[8] eq 'true'))
            {
                
                system("pkzip25 -add -max $reports{$rep}[5]/$logfile-$
+rep");
                system("move $reports{$rep}[5]/$logfile-$rep.zip /zip"
+);
                
            }
        }
        else
        {
            die "Can't close output file: $!";
        }
    }
    
}


#this fucntion takes 1 argument (a date string of the form "MM/DD/YYYY
+")
#$year, $mday and $mon must be initialized prior to calling this funct
+ion
#isExpired returns true if the argument is chronologically after the d
+ate
#represented by $mon, $year and $mday  otherwise it returns false
sub isExpired
{
    $_= pop @_;
    chomp;
    $temp = "none";
    if(/$temp/)
    {
        return 0;
    }
    
    /([0-9]+)\/([0-9]+)\/([0-9][0-9][0-9][0-9])/;
    $repMon = $1;
    $repDay = $2;
    $repYear = $3;
    
    if($repYear > $year)
    {
        return 0;
    }
    elsif ($repMon > $mon && $repYear == $year)
    {
        return 0;
    }
    elsif ($repDay > $mday && $repMon == $mon && repYear == $year)
    {
        return 0;
    }
    else
    {
        return 1;
    }
}
[download]

Comment on Pattern matching speed Select or Download Code

Replies are listed 'Best First'.
Re: Pattern matching speed by chromatic (Archbishop) on Jun 10, 2003 at 19:25 UTC
It looks to me like you've shifted the burden of your processing away from looking at each log file line more than once to looking at each comparison of each report more than once. I'm not surprised this is a lot slower; you've probably tripled the number of comparisons. In particular, when you have an operation such as this one in a loop: `@comparisons = split(" ", $reports{$rep}[0]);` you should ask yourself, "Self, why not just stick this information in the data structure as an array when I generate it?" Without seeing where `%reports` comes from, I can't give you any more specific ideas. If you could classify each line with only one comparison, though, I bet you'd see a better speed gain.	[reply] [d/l] [select]
Re: Pattern matching speed by BrowserUk (Patriarch) on Jun 10, 2003 at 23:30 UTC
There are several things that would probably help to improve the performance of you script. Rather than using system to start a shell to do things like creating directories and deleting files, you should use the builtin functions: mkdir and unlink. You could also (probably) save shelling out twice when you zip the data and then move it, by combining the two command into one using the '&&' conditional command seperator. Ie. `system('pkzip25 ..... && move .... \zip');` If the data being zipped is produced by the other perl script you are running, then you might look at using one of the zip modules from cpan to write directly into the zip. In the first while loop, you split $currentLine into the global array @tokens. You then decide whether to store this data in to your hash, the re-assign the first @tokens1..9 (one at a time) into a myd array @temp, so that you can assign a reference to that array in your hash. That whole process is much more easily acheived using an array slice and an anonymous array `$reports{$tokens[0]} = [ @tokens[1..9] ];` Probably another micro-optimisation, but easier to type and read if nothing else. It generally makes sense to use lexical my variables rather than globals. Firstly, these tend to be somewhat quicker to access than globals. Its a micro-optimisation, but worth having if you need better performance. The other benefit of doing so it that it would allow you to use strict, which would have caught your typo on line 173. `elsif ($repDay > $mday && $repMon == $mon && repYear == $year)` Did you spot it yet? Perl did once strict was in force. You omitted the sigil ($) on repYear. It may not break the code but it would be better if it was corrected. In your inner-most (and therefore most repeated) loop, you are setting a flag if you get a match. You also test that variable on each subsequent iteration and don't attempt further matches if you've already found one, but you still iterate through all of @comparisons. You can easily short-circuit that by using last to exit the loop as soon as you find a match. It's quite likely, but there is not enough information on what is in your @comparisons array, that you could join your search terms together and perform the match in a single regex (used twice), rather than looping over them. Something like `@comparisons = map{ quotemeta } @comparisons; my $re_comps = '(?:' . join( '\|', @comparisons ) . '); $re_comps = qr[$re_comps];` [download] I haven't done this as the best way of doing will depend on the nature of the contents of that array. Eg. If they can be regexes themselves, then you would need to treat them differently. You (and several other people:) will probably hate what I've done to your script. In particular, sorry that the comments got 'lost' during refactoring. It should at least do the same as your original, and with a bit of luck might run a little quicker. It might give you a starting place to go forward from. It would very much be worth using Devel::SmallProf or similar to profile the program to find out where the time is being spent, before making further optimisations, or even accepting those I've offered. Read more... (4 kB) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l] [select]