tekniko has asked for the wisdom of the Perl Monks concerning the following question:

This program reads IIS logs and parses them into NCSA format in order to be processed by a statistics program. Again, I would like this done in the most efficient manner possible, so any suggestions to improve efficieny are more than welcome.
#!/usr/bin/perl # Microsoft IIS Log Format # fields are separated by commas # a hyphen '-' serves as placeholder if no valid data present # my $user_ip_addr; my $username; # username of user RFC931 ? my $date; # MM/DD/YY my $time; # H:MM:SS my $ms_service; # such as W3SVC1 my $server_hostname; # such as NTPUB1 my $server_ip_addr; my $time_elapsed; # in seconds my $bytes_received; my $bytes_sent; my $service_status_code; # HTTP code my $mswin_status_code; # MS Windows NT status code my $op_name; # such as GET, POST, HEAD, PUT my $op_target; # such as index.html my $junk; my $date_year; my $date_mon; my $date_dd; my @months = qw( NUL Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ) +; my $line; my $utc_offset = '-0500'; my $century = '20'; while (defined ($line = <STDIN>)) { chomp $line; $line =~ tr/\000//d; $line =~ s# "HTTP/1\.# HTTP/1.#; next if $line eq ''; ( $user_ip_addr, $username, $date, $time, $ms_service, $server_hostname, $server_ip_addr, $time_elapsed, $bytes_received, $bytes_sent, $service_status_code, $mswin_status_code, $op_name, $op_target, $junk ) = split(/, /, $line, 15); next if $op_target eq '' or ! defined $op_target; ($date_mon, $date_dd, $date_year) = split(m#/#, $date, 3); $century = ( $date_year < 90 ) ? '20' : '19'; $date_mon = sprintf("%02s",$date_mon); $date_dd = sprintf("%02s",$date_dd); $date_mon = $months[$date_mon]; # $time =~ s/([0-9]:[0-9][0-9]:[0-9][0-9])/0$1/; $time = sprintf "%08s", $time; #NCSA combined log format: #$remote_host $remote_logname $remote_user $time_commonlog "$reque +st" $status $bytes_sent "$http_referer" "$http_user_agent" print "$user_ip_addr $username - [$date_dd/$date_mon/$century$date +_year:$time $utc_offset] \"$op_name $op_target\" $service_status_code + $bytes_sent \"-\" \"-\"\n"; # print join(' ', # $user_ip_addr, # $username, # '-', # join('', '[', $date_dd, '/', $date_mon, '/', $century, # $date_year, ':', $time, ' ', $utc_offset, ']'), # join('', '"', $op_name, ' ', $op_target, '"'), # $service_status_code, # $bytes_sent, # '"-"', # HTTP_REFERER # '"-"' # HTTP_USER_AGENT # ); # print "\n"; }

Replies are listed 'Best First'.
Re: Efficiency revisited (a caveat)
by dws (Chancellor) on Dec 26, 2000 at 23:15 UTC
    This approach has one large pitfall, which you are only guaranteed to sidestep if you control the web server. I mention it here as a caveat to whoever might pick up your code and then wonder why it fails at inopportune times.

    The set of fields in an IIS logfile is adjustable via the IIS control panel. When IIS starts a new logfile, it prints the names of the fields that it will log in a #Fields: comment. IIS lets you change this set of fields in mid-log, and then emits a new #Fields: comment into the log in mid-stream. Boom. There goes your script.

    Does this ever happen in practice? It has to me. I've seen our network or support folks change the set of fields logged while they're diagnosing connectivity problems. Even if they're aware that it's going to cause analysis problems, that's easy to overlook in the heat of battle, or write off as SOP (Someone Else's Problem).

    The code snippet that demonstrates how to cope with this is here, posted by your truly as an early contribution before I registered at the Monastery. In brief, it introduces an level of indirection by way of a hash. It'll probably introduce a small-but-measurable performance hit, which you'll need to weigh against the risk of your script failing at the worst possible moment.

      Ah...Yeah, I guess that I should have checked the archives first. Point taken.
Re: Efficiency revisited
by chipmunk (Parson) on Dec 26, 2000 at 23:23 UTC
    Two coding comments...

    The inclusion of $junk on the LHS of the split assignment is unnecesary. Take it out, remove the third argument to split on the RHS, and split will automatically split the string into no more than 15 pieces (one more than the number of lvalues on the left) and discard the extra piece.

    next if $op_target eq '' or ! defined $op_target;
    That condition is backwards; if $op_target is not defined, it will always be equal to ''. To avoid a warning; the defined test should come first.


    And one efficiency comment...

    You may get a small measure of improved efficiency by combining the calls to sprintf with the call to print. For example:

    printf "$user_ip_addr $username - " . "[%02d/$months[$date_mon]/$century%02d:%08d $utc_offset] " . qq{"$op_name $op_target" $service_status_code $bytes_sent "-" " +-"\n}, $date_dd, $date_year, $time;
Re: Efficiency revisited
by ichimunki (Priest) on Dec 26, 2000 at 23:02 UTC
    If all the handling of these variables is going to stay within the scope of the while loop, why not just declare them as part of the split list assignment : my ($x, $y, $x) = split ('foo', @bar);)? If they are going to be external to the loop, you might look at using a list of hashes, so that you can refer to any element by using $log[line#]{'element name'}. Just my two cents.
Re: Efficiency revisited
by ChOas (Curate) on Dec 27, 2000 at 12:39 UTC
    Please, PLEASE don't get pissed off at me, but,
    About efficiency: I think IIS has an option to
    output logs in NCSA format...
      It does, indeed, but changing the configuration on all of our production IIS web servers is not an option at this point. Until I have completed converting all existing log files for several thousand domains, we will have to continue this method. New domains log in NCSA format.