in reply to Re^2: fast greedy regex
in thread fast greedy regex

Ouch. Perhaps all of your log lines are perfectly formatted, but I would still recommend not doing that just in case you somehow have something fail. At least anchor the expression.

The problem is that because you are using * everywhere, there are an exponential number of ways for that match to fail. Perhaps Perl is clever enough to avoid it, but it seems to me that if you hit a single malformed line, that expression could hang.

I'd recommend avoiding the issue by using + pretty much everywhere you have a *, and anchoring the ends with ^ and $. Also, as someone else mentioned, it would be better to get rid of all those printf's and replace them with:

print <<"END_FORMAT"; date time = $1 time taken = $2 c-ip = $3 . . . END_FORMAT

It also appears that you'd be better off doing a little extra work so that you can use split instead of a regex:

my ($user_agent) = /\"(.*)\"/; s/\".*\"//; my @F = split(/\s+/); print <<"END_FORMAT"; date time = $F[0] $F[1] time taken = $F[2] . . . cs(Content-Type) = $F[15] cs(User-Agent) = $user_agent sc-filter-result = $F[16] . . . END_FORMAT
Alternatively, you could try
my @F = /(\" .*? \" | \S+)/gx;
(but remember to cut the parens off the relevant item.)

Replies are listed 'Best First'.
Re^4: fast greedy regex
by js1 (Monk) on Jun 08, 2004 at 20:49 UTC

    Many thanks for all the interest and help here. All the replies were useful.

    I really liked these constructs:

    s/\".*\"// my @F = split(/\s+/);

    and

    my @F = /(\" .*? \" | \S+)/gx

    But I found the quickest solution was this:

    while (<>){ $front = substr( $_, 0, index($_, '"' )-1, ""); $back = substr( $_, rindex( $_, '"' )+2); $user_agent = substr ($_, 1, rindex( $_, '"' )); $front=~/^([^#\s]+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+) +\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)$/; print <<END_FORMAT; date time = $1 time taken = $2 c-ip = $3 sc-status = $4 ns-action = $5 sc-bytes = $6 cs-bytes = $7 cs-method = $8 cs-uri-scheme = $9 cs-host = $10 cs-uri-stem = $11 cs-username = $12 s-hierarchy = $13 s-supplier-name = $14 cs(Content-Type) = $15 cs(User-Agent) = $user_agent END_FORMAT $back=~/^(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s*$/; print <<END_FORMAT2 sc-filter-result = $1 sc-filter-category = $2 x-virus-id = $3 s-ip = $4 s-sitename = $5 END_FORMAT2 }

    This processed the following gzip'd log:

    bash-2.05b$ ls -l SG* -rwxr-xr-x 1 js js 106236830 Mar 7 17:02 SG_CSGL02_mai +n_470302220000.log.gz

    in 1 minute 32 sec

    bash-2.05b$ time gzip -dc SG* | ./test.pl >/dev/null 6.96user 0.63system 1:32.95elapsed 8%CPU (0avgtext+0avgdata 0maxreside +nt)k 0inputs+0outputs (94major+33minor)pagefaults 0swaps

    on a 2.6Ghz AMD processor (500MB).