in reply to splitting a string that appears inconsistently in structure

Based on looking at Apache access log files for several years, I believe that we can rely on the following to be true (assuming we are using the default log format):
1. The method is always present.
2. The request URI is always present, and may or may not contain query params, but will never contain spaces
3. The version may not be present.
So, I propose that you split off the method+uri, treat the remainder as version and use the URI::Split module to break apart the URI:
use URI::Split;
sub split_request
{
    my @parts=split(/ /,$_[0]);
    scalar(@parts)>=2 or die "Bad request '$_[0]'";
    my $method=shift @parts;
    my $uri=shift @parts;
    my ($scheme, $auth, $path, $query, $frag) = uri_split($uri);
    my $protover=join(' ',@parts);
    return ($method,$scheme,$auth,$path,$query,$frag,$protover);
}
  • Comment on Re: splitting a string that appears inconsistently in structure

Replies are listed 'Best First'.
Re^2: splitting a string that appears inconsistently in structure
by TheGorf (Novice) on Jan 02, 2009 at 06:41 UTC
    Unfortunately I don't find those all to be true at all. For reference, here are some examples of entries that I have:

    62.88.40.141 - - [09/Jan/2008:03:45:10 -0800] "GET /core_level.cgi?cor +e=1 HTTP/1.1" 302 83 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows +NT 5.1; .NET CLR 1.1.4322)" "62.88.40.141" (call this a normal-ish request) 10.16.0.2 - - [09/Jan/2008:02:20:39 -0800] "GET /home/eval_load.cgi?50 +" 200 2 "-" "-" "10.16.0.2" (no version) 10.16.1.3 - - [10/Jan/2008:02:18:58 -0800] "GET /" 200 752 "-" "-" "10 +.16.1.3" (no version, no ?, and nothing after the ?) 10.16.0.2 - - [19/Jan/2008:03:45:06 -0800] "GGG99994" 200 752 "-" "-" +"10.16.0.2" (here we have no method, no discernible request, and no v +ersion)
    Hence the need to figure out how to detect what is there.
      10.16.0.2 - - [19/Jan/2008:03:45:06 -0800] "GGG99994" 200 752 "-" "-" +"10.16.0.2"
      (here we have no method, no discernible request, and no version)

      Not exactly true. You have a method. It's just a really weird (and probably invalid) one. I'm not sure why your server would 200 it; I can only presume some slightly odd config.

      Take it in individual steps. First try splitting out into the 3 main pieces:

      my ($method, $uri, $proto, @extra) = split /\s+/, $request; die "Unexpected extra bits in request: @extra" if @extra > 0; die "No method" unless defined $method; # Or whatever other error-handling mechanism you want

      You shouldn't have any extra bits, becaue if you do, that means that your $method, $uri, $proto may not hold what you expect them to, so that needs error-checking.

      As well, you should have a method. The minimal possible HTTP request AFAIK would be a method of " ", with nothing else. That would leave all the vars undefined, and probably isn't something you care about anyway, so another error there.

      The protocol may not be there. But expect that in higher level code, or defined-or it to an empty string here if you prefer.

      That leaves the URI. Using URI::Split as suggested above in Re: splitting a string that appears inconsistently in structure would be better than trying to split it up manually. Imagine, for instance, the case of having a '?' in the password; a simple regexp would give you a wrong answer then.

      Note that the $uri can be undefined. A request of just "GET " is interpreted as "GET /" (similarly with POST), and would leave $uri undefined after that split. You probably want to make sure it's defined (as an empty string in this case) before you pass it to uri_split(). The URI::Split docs say:

      The $path part is always present (but can be the empty string) and is thus never returned as "undef".

      So take care not to blow up if it's empty.

        That leaves the URI.

        For the sake of precision, by the by, it's not really a URI we've got here, it's just the path/query bit of it. But uri_split() does the right thing.

      10.16.0.2 - - [19/Jan/2008:03:45:06 -0800] "GGG99994" 200 752 "-" "-" "10.16.0.2"

      Is it real line from log? Status code is 200 Ok, so it looks like your apache successfully handled this request, though it shouldn't.

      I think it may be a good idea to handle malformed requests separately.

      Whats the format string for that log?