comment on

I am writing a small script to process my web proxy log to replace the DOS bat file that does it now. The bat file outputs 5 different text files (per log) based on 5 successive findstr calls of the form:

for %%f in (htt*) do findstr "mail.yahoo aolmail hotmail" %%f > %%f.ma
+il
for %%f in (htt*) do findstr "some other string" %%f > %%f.mail
[download]

etc. I figured that since the bat file is essentially reading these files in 5 times, I can write a perl script that reads the file once and does all 5 comparisons on each line. Since I'd only be reading the files 1 time, it should be more efficient. Imagine my surprise after writing the script only to find that the Perl version takes much longer (2x) as long to complete as the Bat file. Can anyone shed some light on this? In particularly, I'm interested if there's a more efficient way to do pattern matching. Here's the script...

use Cwd;

@logs = <@ARGV>;
%reports=();
%results=();

#first, read in the input file to populate our data structure
#the structure is a hash of arrays where the hash key is the name for 
+the output file
#the first element of the array is the string we're looking for, the s
+econd is the email,
#address, and the third is the expiration date

#get the current date for use in the upcoming loop
($sec,$min,$hr,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(time);
+ 
#apply the offsets to the year and month so we can do a straight compa
+rison
$year = $year +1900;
$mon = $mon +1;

$path = cwd;


open (INPUT_IN, "bat/FindSitesInput.txt")|| die "Can't open input file
+! :$!";

while (defined($currentLine = <INPUT_IN>)) 
{
    $_=$currentLine;
    if(! m/^\#/)
    {
        #split currentLine into tab-delimited tokens
        @tokens = split("\t",$currentLine);
        #only create an entry if the report is still valid
        
        
        if(!isExpired($tokens[5]))
        {
            #since we're using references, must declare temp as "my" t
+o ensure we're creating
            #an object that is local to this block of code.
            my @temp = ($tokens[1], $tokens[2], $tokens[3], $tokens[4]
+, $tokens[5], $tokens[6], $tokens[7], $tokens[8], $tokens[9]);
            #can only store a reference to an array in the hash (not t
+he array itself)
            $reports{$tokens[0]}=\@temp;
            #initialize the results structure here
            my @data=();
            $results{$tokens[0]}=\@data;
        }
    }
}


close (INPUT_IN) || die "Can't close input file: $!";



#now run the reports on each log file
foreach $logfile(@logs)
{
    print "\nSearching $logfile\n";
    open (LOG_IN, "$logfile")|| die "Can't open $logfile! :$!";
    while (defined($currentLine = <LOG_IN>)) 
    {    
    
        #Only want to search the files once (obviously)
        #therefore must apply all the tests in the "reports" datastruc
+ture
        #to each line in the file.
        
        #logs are space delimited
        @tokens = split(" ",$currentLine);
                
        foreach $rep (keys %reports)
        {
            
            @comparisons = split(" ", $reports{$rep}[0]);
            $match = 0;
            foreach $item (@comparisons)
            {
                if($reports{$rep}[1] eq 'site' && !$match)
                {        
                    $_=$tokens[6];
                    if(/$item/)
                    {
                        $match = 1;
                    }
                }
                elsif ($reports{$rep}[1] eq 'ip' && !$match)
                {
                    $_=$tokens[2];
                    if(/$item/)
                    {
                        $match = 1;
                    }
                }
            }
            if($match && $reports{$rep}[2] eq 'normal')
            {
                push @{$results{$rep}}, $currentLine;
            }
            elsif(!$match && $reports{$rep}[2] eq 'reverse')
            {
                push @{$results{$rep}}, $currentLine;
            }
        }
        
    }
    close (LOG_IN) || die "Can't close log file: $!";
    
    #now write the output files for each report for this log
    foreach $rep (keys %results)
    {
        if(! (-d $reports{$rep}[5]))
        {
            system("mkdir $reports{$rep}[5]");
        }
        open (OUTPUT, ">$reports{$rep}[5]/$logfile-$rep")|| die "Can't
+ open $reports{$rep}[5]/$logfile-$rep! :$!";
        print OUTPUT @{$results{$rep}};
        if(close OUTPUT) 
        {    
            if(!($reports{$rep}[3] eq 'none'))
            {
                system("c:/blat/blat \"$path/$reports{$rep}[5]/$logfil
+e-$rep\" -t \"$reports{$rep}[3]\"");
            }
            if(($reports{$rep}[6] eq 'true'))
            {
                #call IPandBytes.pl
                system("perl bat/ipandbytes.pl $reports{$rep}[5]/$logf
+ile-$rep");
                #remove file if desired
                if(($reports{$rep}[7] eq 'false'))
                {
                    system("del $reports{$rep}[5]\\$logfile-$rep");
                }
            }
            if(($reports{$rep}[8] eq 'true'))
            {
                
                system("pkzip25 -add -max $reports{$rep}[5]/$logfile-$
+rep");
                system("move $reports{$rep}[5]/$logfile-$rep.zip /zip"
+);
                
            }
        }
        else
        {
            die "Can't close output file: $!";
        }
    }
    
}


#this fucntion takes 1 argument (a date string of the form "MM/DD/YYYY
+")
#$year, $mday and $mon must be initialized prior to calling this funct
+ion
#isExpired returns true if the argument is chronologically after the d
+ate
#represented by $mon, $year and $mday  otherwise it returns false
sub isExpired
{
    $_= pop @_;
    chomp;
    $temp = "none";
    if(/$temp/)
    {
        return 0;
    }
    
    /([0-9]+)\/([0-9]+)\/([0-9][0-9][0-9][0-9])/;
    $repMon = $1;
    $repDay = $2;
    $repYear = $3;
    
    if($repYear > $year)
    {
        return 0;
    }
    elsif ($repMon > $mon && $repYear == $year)
    {
        return 0;
    }
    elsif ($repDay > $mday && $repMon == $mon && repYear == $year)
    {
        return 0;
    }
    else
    {
        return 1;
    }
}
[download]

In reply to Pattern matching speed by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.