speed issues

vatheck has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to find a way to make this faster, it just takes way too long to run. its grabbing its data from flat files of csv data. trying to calculate the statistic its looking at for each hour of each day for all of the roughly 30 days we keep the logfile it grabs its data from. feel free to ask any questions and I'll try to answer when I get the chance.

#!/usr/bin/perl

#pragmas
use strict;
use warnings;

#modules
use File::Find::Rule;

#declaring global variables
use vars qw/$reportfile $lastday @datadir @sites @dates %devicecount @
+workingdata/;

#set up the datadir array
@datadir = qw[/dir1 /dir2 /dir3 /dir4 /dir5 /dir6 /dir7];

#set up the last day variable for file::find::rule
my $lastday = time()-86400;

#get a list of sites
{
    my %uniquehash;
    my @devicefilelist = File::Find::Rule->file()->name('*SomeText*')-
+>mtime("<$lastday")->in(@datadir);
    foreach (@devicefilelist) {
        if (open FILE, "$_") {
            while (<FILE>) {
                my $firstsplit = (split /,/, $_)[2];
                my $site = (split /M/, $firstsplit)[0];
                $uniquehash{$site} = 1;
            }
            close FILE;
        }
    }
    @sites = sort keys %uniquehash;
    chomp @sites;
}

#get a list of dates
{
    my %uniquehash;
    my @datefilelist = File::Find::Rule->file()->name('*SomeText*')->i
+n(@datadir);
    foreach (@datefilelist) {
        my $date = (split /_/, $_)[1];
        $uniquehash{$date} =1;
    }
    @dates = sort keys %uniquehash;
}

#get a device count for each site
{
    my %uniquehash;
    my @devicecountlist = File::Find::Rule->file()->name('*Texthere*')
+->mtime("<$lastday")->in(@datadir);
    foreach my $file (@devicecountlist) {
        if (open FILE, $file) {
            while (<FILE>) {
                if (!/SomethingIDontWant/ && !/SecondThingIdontWant/ &
+& !/ThirdThingIdontWant/) {
                    my $device = (split /,/, $_)[2];
                    $uniquehash{$device} = 1;
                }
            }
        close FILE;
        }
    }
    foreach my $site (@sites) {
        foreach (keys %uniquehash) {
            if (/${site}b/) {
                $devicecount{$site} ++;
            }
        }
    }
}

#the main statistics routine
{
    my ($site, $devicecount, $value, $counter, @workingdata, @workingd
+ay);
    my @allfilenames = File::Find::Rule->file()->name( qr/Site-Calls/ 
+)->in(@datadir);
    open REPORT, ">>", &gen_reportfile_name();
    foreach my $date (@dates) {
        @workingday = grep /$date/, @allfilenames;
        @workingdata = ();
        foreach my $file (@workingday) {
            if (open CURRENT, $file) {
                while (<CURRENT>) {
                    push @workingdata, $_;
                }
                close CURRENT;
            }
        }
        foreach (0..23) {
            my $hour = sprintf "%02d", $_;
            while (($site, $devicecount) = each %devicecount) {
                $value = 0;
                $counter =0;
                foreach (@workingdata) {
                    if (/${date}-${hour}/ && /$site/) {
                        my @Fld = split /,/;
                        $value += $Fld[11];
                        $counter++;
                    }
                }
                print REPORT "$site\t$bscount\t${date}-${hour}\t";
                if ($counter == 0) {print REPORT "0\n"}
                &dividebyzero($counter, $value);
            }
        }
    }
}

#divide by 0 subroutine
sub dividebyzero {
    if (($_[0] > 0) && ($_[1] > 0)) {
        print REPORT $_[1]/$_[0] . "\n";
    } elsif (($_[0] > 0) && ($_[1] == 0)) {
        print REPORT "0\n";
    }
}

#generate the reportfile name
sub gen_reportfile_name {
    my ($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst);
    ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime(ti
+me);
    $reportfile = sprintf "/dirhere/%4d%02d%02dtalkpath", $year+1900,$
+mon+1,$mday;
}
[download]

Comment on speed issues Download Code

Replies are listed 'Best First'.
Re: speed issues by samtregar (Abbot) on May 16, 2008 at 03:42 UTC
If you want to make your code faster you need to know where it's spending the most time. To do that, use a profiler. Normally I would suggest you start with Devel::DProf but since you don't have a lot of subroutines it won't be much help. Instead something like Devel::SmallProf might be better, since it's line-oriented. Once you've identified the slow parts of your code you can work on a solution. You'll be able to post more focused questions which are more likely to get answers here. -sam	[reply]
Re^2: speed issues by vatheck (Novice) on May 16, 2008 at 03:52 UTC
thats a new concept to me, thanks for the tip. I am just about off work for the night but I will try that out tomorrow when I get in.	[reply]
Re: speed issues by 5mi11er (Deacon) on May 16, 2008 at 04:01 UTC
I'm not yet completely groking your pseudo code, but it appears that you are doing way more work "finding" files than you need to do. Why do you need to find things 3 separate times? Oops, make that 4. If the directories are very large (many many files), each find is going to be very expensive. So, suggestion number one is to figure out a better way of getting at your data without having to search through several large directories multiple times. It also appears that you may be iterating over the same information multiple times. You're getting a list of sites first, then counting how many there were? Count them as you find them if that's actually what's happening. Thus, suggestion two is to figure out how to iterate over the data once to both get the info you need and process it at the same time. -Scott	[reply]
Re: speed issues by CountZero (Bishop) on May 16, 2008 at 05:16 UTC
How slow is too slow? If you only have to run this program once a month (you are looking at the data for the last 30 days), is it really necessary it finishes in a few seconds? I would have it started once a month as a `cron`-job during the night and even if it runs a few hours, I will still find its results when I start in the morning. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Re: speed issues by apl (Monsignor) on May 16, 2008 at 09:48 UTC
You should consider using your systems sort utility (keyed on the time) on your log files at the end of each business day. You could then revise your code to go through each file's @workingdata once, rather than the current 24 times.	[reply]
Re: speed issues by toolic (Bishop) on May 16, 2008 at 13:08 UTC
This has nothing to do with your speed issue, but I thought I'd offer a less cluttered version of your `sub gen_reportfile_name`: `sub gen_reportfile_name { my ($mday, $mon, $year) = (localtime)[3,4,5]; $reportfile = sprintf '/dirhere/%4d%02d%02dtalkpath', $year+1900, +$mon+1, $mday; }` [download] This uses an array slice to extract only the elements of interest from the list returned by localtime.	[reply] [d/l] [select]