Re^2: fast greedy regex

Just to give an idea of what I'm doing this is a line from the log:

2004-03-01 22:00:12 2 15.32.17.34 200 TCP_HIT 3140 326 GET http www.wahm.com http://www.wahm.com/images/vote.gif u779479 DEFAULT_PARENT 61.2.249.106 - "Mozilla/4.0 (compatible; MSIE 5.01; Windows 95)" OBSERVED none - 61.2.249.47 SG-HTTP-Service

And my regex is like this:

while (<>){
        /(\S*\s*\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s
+*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\".
+*\")\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)/;
        #print "\n";
        #printf "\ndate time    = %s",$1;
        #printf "\ntime taken    = %s",$2;
        #printf "\nc-ip        = %s",$3;
        #printf "\nsc-status    = %s",$4;
        #printf "\ns-action    = %s",$5;
        #printf "\nsc-bytes    = %s",$6;
        #printf "\ncs-bytes    = %s",$7;
        #printf "\ncs-method    = %s",$8;
        #printf "\ncs-uri-scheme    = %s",$9;
        #printf "\ncs-host    = %s",$10;
        #printf "\ncs-uri-stem    = %s",$11;
        #printf "\ncs-username    = %s",$12;
        #printf "\ns-hierarchy    = %s",$13;
        #printf "\ns-supplier-name = %s",$14;
        #printf "\ncs(Content-Type)= %s",$15;
        #printf "\ncs(User-Agent) = %s",$16;
        #printf "\nsc-filter-result = %s",$17;
        #printf "\nsc-filter-category = %s",$18;
        #printf "\nx-virus-id     = %s",$19;
        #printf "\ns-ip        = %s",$20;
        #printf "\ns-sitename    = %s",$21;

}
[download]

Can you see whether a more explicit regex would speed the parse up?

Thanks,

js1.

Comment on Re^2: fast greedy regex Download Code

Replies are listed 'Best First'.
Re^3: fast greedy regex by sfink (Deacon) on Jun 08, 2004 at 04:41 UTC
Ouch. Perhaps all of your log lines are perfectly formatted, but I would still recommend not doing that just in case you somehow have something fail. At least anchor the expression. The problem is that because you are using * everywhere, there are an exponential number of ways for that match to fail. Perhaps Perl is clever enough to avoid it, but it seems to me that if you hit a single malformed line, that expression could hang. I'd recommend avoiding the issue by using + pretty much everywhere you have a , and anchoring the ends with ^ and $. Also, as someone else mentioned, it would be better to get rid of all those printf's and replace them with: `print <<"END_FORMAT"; date time = $1 time taken = $2 c-ip = $3 . . . END_FORMAT` [download] It also appears that you'd be better off doing a little extra work so that you can use `split` instead of a regex: `my ($user_agent) = /\"(.)\"/; s/\".\"//; my @F = split(/\s+/); print <<"END_FORMAT"; date time = $F[0] $F[1] time taken = $F[2] . . . cs(Content-Type) = $F[15] cs(User-Agent) = $user_agent sc-filter-result = $F[16] . . . END_FORMAT` [download] Alternatively, you could try `my @F = /(\" .? \" \| \S+)/gx;` [download] (but remember to cut the parens off the relevant item.)	[reply] [d/l] [select]
Re^4: fast greedy regex by js1 (Monk) on Jun 08, 2004 at 20:49 UTC
Many thanks for all the interest and help here. All the replies were useful. I really liked these constructs: `s/\".\"// my @F = split(/\s+/);` [download] and `my @F = /(\" .? \" \| \S+)/gx` [download] But I found the quickest solution was this: while (<>){ $front = substr( $_, 0, index($_, '"' )-1, ""); $back = substr( $_, rindex( $_, '"' )+2); $user_agent = substr ($_, 1, rindex( $_, '"' )); $front=~/^([^#\s]+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+) +\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)$/; print <<END_FORMAT; date time = $1 time taken = $2 c-ip = $3 sc-status = $4 ns-action = $5 sc-bytes = $6 cs-bytes = $7 cs-method = $8 cs-uri-scheme = $9 cs-host = $10 cs-uri-stem = $11 cs-username = $12 s-hierarchy = $13 s-supplier-name = $14 cs(Content-Type) = $15 cs(User-Agent) = $user_agent END_FORMAT $back=~/^(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s$/; print <<END_FORMAT2 sc-filter-result = $1 sc-filter-category = $2 x-virus-id = $3 s-ip = $4 s-sitename = $5 END_FORMAT2 } [download] This processed the following gzip'd log: `bash-2.05b$ ls -l SG -rwxr-xr-x 1 js js 106236830 Mar 7 17:02 SG_CSGL02_mai +n_470302220000.log.gz` [download] in 1 minute 32 sec `bash-2.05b$ time gzip -dc SG* \| ./test.pl >/dev/null 6.96user 0.63system 1:32.95elapsed 8%CPU (0avgtext+0avgdata 0maxreside +nt)k 0inputs+0outputs (94major+33minor)pagefaults 0swaps` [download] on a 2.6Ghz AMD processor (500MB).	[reply] [d/l] [select]
Re^3: fast greedy regex by Roy Johnson (Monsignor) on Jun 07, 2004 at 22:51 UTC
I don't see any reason that a more explicit regex would speed it up. The only obvious speed benefit from a more explicit match is that failure could happen sooner. For speed, I would expect that one call to print, rather than multiple calls to printf, would be something of a speedup. The PerlMonk `tr///` Advocate	[reply]
Re^3: fast greedy regex by CountZero (Bishop) on Jun 08, 2004 at 06:18 UTC
One unexpected space somewhere in this string and you are toast. So don't re-invent the (broken) wheel and use a module such as Regexp::Log or Logfile or their derived classes. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]