comment on

By this point, you'd think the topic of web server logfile parsing would be completely mined out, with all of the rough edges filed off, if not pounded completely flat. Something that showed up recently in my logfiles, coupled with what I've seen recently in print suggests that the topic still has some unmapped pitfalls.

Here's the familiar drill: To parse a "common log format" file (assuming you're doing it yourself), the conventional wisdom says to write:

  while (<>) {
    my ($host, $ident_user, $auth_user, $date, $time,
            $time_zone, $method, $url, $protocol, $status, $bytes) = 
    /^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+\] "(\S+) (.+?)
+ (\S+)" (\S+) (\S+)$/;
    ...
[download]

Yawn. Been there, done that, right?

Maybe not. Let's take a second look at $auth_user. Unless you're using basic authentication to password protect pages, you'll see this in your logs as '-'. No problem there. And if you are using basic authentication, you'll see a username. No problem there... unless the username cannot contain whitespace, at which point the regexp fails to match. And since there's no check to see if it fails...

But can a username contain whitespace? Let's see.

  % htpasswd .htpasswd 'd w s'
  New password:
  Re-type new password:
  Adding password for user d w s
  %
[download]

D'oh! RFC1945 says you aren't supposed be able to do this! (Update: RFC1945 is obsolete. RPF2617 suggests that embedded spaces are OK. Hm...)

The simple solution would seem to be "So, don't do that!", but here's where things get stranger. I've recently seen a case where somebody apparently presented an Authentication: header to a non-protected resource on my site, resulting in a bogus name appearing in the logs. I say "apparently" because I've been able to duplicate the behavior, and I can't think of any other way for the bogus username to have appeared. A minor annoyance, or the basis for a crude denial-of-accurate-service attack against log analysis software.

Fortunately, the solution is straightforward. All you have to do is change /^(\S+) (\S+) (\S+) \[ ... to /^(\S+) (\S+) (.+) \[ ... This makes the regex much less efficient, since it's going to backtrack to match '[', but it will resolve the problem, even if someone forges the username "dws [".

In reply to A rare, insidious logfile parsing pitfall by dws

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.