comment on

I just finished writing a regular expression that parses an Apache log file in the "Transfer Log" format. This format has has 9 fields per line. If one of the fields is empty, Apache puts a hypen, "-" or -, in its place. Here's a sample Apache log file entry:

208.168.76.195 - - [27/Jun/2001:08:04:53 -0400] "GET /core.css HTTP/1.
+1" 304 - "http://www.progressivevalley.com/" "Mozilla/4.0 (compatible
+; MSIE 5.0; Windows 98; DigExt)"
217.50.206.61 - - [27/Jun/2001:08:19:54 -0400] "GET / HTTP/1.1" 200 21
+80 "-" "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.1)"
217.50.206.61 - - [27/Jun/2001:08:19:55 -0400] "GET /core.css HTTP/1.1
+" 200 146 "http://www.progressivevalley.com/" "Mozilla/4.0 (compatibl
+e; MSIE 6.0b; Windows NT 5.1)"
217.50.206.61 - - [27/Jun/2001:08:20:00 -0400] "GET /images/morningalt
+text.gif HTTP/1.1" 200 1424 "http://www.progressivevalley.com/" "Mozi
+lla/4.0 (compatible; MSIE 6.0b; Windows NT 5.1)"
217.50.206.61 - - [27/Jun/2001:08:20:00 -0400] "GET /images/logosun.jp
+g HTTP/1.1" 200 32232 "http://www.progressivevalley.com/" "Mozilla/4.
+0 (compatible; MSIE 6.0b; Windows NT 5.1)"
[download]

This should be enough info for you to write an RE on to parse each of the 9 fields. The speed of the RE is very important because the file can run very large. If you are beginner, try your hand and get some practice. If you are an expert, please feel free to show me and others how you would approach this problem. I'm curious to see if/how you do it differently. Here's how I did it (highlight the grey box on the following page to view):

    m/^([^\s]+)     # 1st field
    \s
    ([^\s]+)        # 2nd
    \s
    ([^\s]+)        # 3rd
    \s
    \[([^\]]+)\]    # 4th
    \s
    "([^"]+)"        # 5th
    \s
    ([^\s]+)        # etc.
    \s
    ([^\s]+)
    \s
    ([^\s]+)
    \s
    "([^"]+)"
    /x;

or in regular format:

m/^([^\s]+)\s([^\s]+)\s([^\s]+)\s\[([^\]]+)\]\s"([^"]+)"\s([^\s]+)\s([
+^\s]+)\s([^\s]+)\s"([^"]+)"/;

Note: if you have any doubts, the following RE was tested on an 18725 
+line file and successfully parsed all lines.
[download]

I know there is a module that parses Apache files. I also know this is a somewhat basic RE but it still proved fairly difficult for someone who has only written a few long REs so I think this can be pretty instructive for me and other new guys to Perl. Perhaps the one considered the best should be added to the Snippets section section if it's not already.

$PM = "Perl Monk's";
$MCF = "Most Clueless ~~Friar~~ Abbot";
$nysus = $PM . $MCF;
Click here if you love Perl Monks

In reply to Somewhat basic but long, practical RE problem by nysus

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.