tekkie has asked for the wisdom of the Perl Monks concerning the following question:

I have a block of code that looks like this:
my ($cap, %dest); # Open a file and slurp in the entire contents to one scalar open(CAP, "$file"); while(<CAP>) { $cap .= $_; } close CAP; # Scan through the contents for source/dest pairs and keep track of ho +w many times the dest gets hit: while($cap =~ /\s+(?:(?:\d{1,3}\.){3}\d{1,3})\s+->(\d{1,3}\.){3}\d{1,3 +})\s+/g) { unless(exists($dest{$1})) { $dest{$1} = 1; } else { $dest{$1}++; } } # Print out the destinations and the number of times they were hit in +descending order foreach my $dest_addr (sort { $dest{$b} <=> $dest{$a} } keys %dest) { print "$dest_addr\t$dest{$dest_addr}\n"; }
For those interested, the code is used to extract source/destination IP pairs from logs we use at the office...

The problem lies in that the code is called from as a CGI script, and when running over 20,000 lines of data (the SAME data on each run, mind you), the process will sometimes take only 5 seconds to run, and other times the process will take up to 5 minutes to run.

I realize that the actual execution times from run to run will be different based on how much activity is on the server at the time, but I really can't explain how parsing exactly the same file with exactly the same code can result in such drastically different execution times.

The remaining code surrounding this block is solid, I've done numerous Benchmark::Timer tests to granularly check each piece of my code, and literally, this is the only place where I get a jump in execution time, and it's only seen sometimes, but as far as I can tell, NOTHING is changing from run to run.

Based on what I've got above, can anyone see where I may be going wrong? Are there glaringly obvious Perl traps that I'm missing? This one really has me stumped....

Thanks in advance,
Tekkie

Replies are listed 'Best First'.
Re: Same code, same data, different execution times?
by dws (Chancellor) on Mar 19, 2003 at 23:37 UTC
    When running over 20,000 lines of data (the SAME data on each run, mind you), the process will sometimes take only 5 seconds to run, and other times the process will take up to 5 minutes to run.

    That's a 60x difference, which seems way more than can be accounted for by considering things like whether the file is in the OS's disk cache.

    Depending on system load and contention for memory, your approach of reading the entire file into memory may be working against you. Given the pattern you're looking for, you could just as easily scan a line-at-a-time, which would keep the CGI's footprint small.

Re: Same code, same data, different execution times?
by Abigail-II (Bishop) on Mar 19, 2003 at 23:09 UTC
    5 minutes for 20k lines? What is the process doing? Is it really running on the CPU then? Is it waiting for I/O? Do you have a tool like HP's glance available to inspect the running process? What's the breakdown in wallclock, systemtime and usertime for the 5 second and 5 minute runs?

    Obviously it's not a well-known feature that Perl sometimes runs 60 times slower than other times on the same input. So, before getting a useful answer, you need to supply more data.

    Abigail

Re: Same code, same data, different execution times?
by graff (Chancellor) on Mar 20, 2003 at 02:42 UTC
    This would be the simpler, less memory-intensive way:
    # Scan for source/dest pairs and count hits on each dest: $ipregex = qr/(?:\d{1,3}\.){3}\d{1,3}/; open(CAP, "$file"); while(<CAP>) { if ( /\s+($ipregex)\s+->($ipregex)\s+/ ) { $dest{$2}++; } } close CAP;
    (I think your original post had a problem in the second part of your regex match. I took the easier way on that too.)
      You need the "/o" option on the end of that regex or it is going to recompile it every time it is executed because of the variable inside of it.
        Ooops my bad. Missed you using the qr// construct. Apologies
Re: Same code, same data, different execution times?
by tekkie (Beadle) on Mar 20, 2003 at 12:52 UTC
    First off, let me say many thanks to everyone who gave me feedback on this problem, with just six replies I've gained not only a solution to my problem, but also jumped a couple of steps forward in my benchmarking, debugging, and just coding in general.

    Now the specifics:

    1. First off, graff was absolutely correct, the second half of the source/dest regexp should've been: ((?:\d{1,3}\.){3}\d{1,3})
    But all things considered, I think I'll be storing smaller chunks of regexps in variables like graff shows, (read as: All along I should've been storing smaller...)

    2. It seems the root of all my troubles was the memory intensive file-slurp... the process was originally designed to run on LWP::UserAgent->content returns, but was later modified to work on the local filesystem instead, I never modded the structure to be able to use line-by-line mode. I was uselessly throwing away resources which was the source of all my troubles.
Re: Same code, same data, different execution times?
by dga (Hermit) on Mar 19, 2003 at 22:17 UTC

    One thing I did notice, instead of using a while to get the scalar, you could:

    { local $/=undef; $cap=<CAP>; }

    The declatations, open, close remain the same. The {} scopes the alteration to $/ to only affect the assignment to $cap.

Re: Same code, same data, different execution times?
by dga (Hermit) on Mar 19, 2003 at 22:20 UTC

    On the raw speed and temporal reliability point could you save the output into a file then do a stat on the summary file time vs. the raw data time and if the summary is newer, display it and not even do the calculations unless the data has been updated.

Re: Same code, same data, different execution times?
by traveler (Parson) on Mar 19, 2003 at 23:29 UTC
    It sounds as though you are comparing wall clock run times. Check the load at the time the program is run. I have seen similar behavior lots of times. Is it possible that the data is in cache sometimes and other times has to be read from disk?

    Benchmarking is tricky. Your example does a good job of showing how factors other than code structure can impact performance.

    HTH, --traveler