in reply to Re: fast greedy regex
in thread fast greedy regex

Just to give an idea of what I'm doing this is a line from the log:

2004-03-01 22:00:12 2 15.32.17.34 200 TCP_HIT 3140 326 GET http www.wahm.com http://www.wahm.com/images/vote.gif u779479 DEFAULT_PARENT 61.2.249.106 - "Mozilla/4.0 (compatible; MSIE 5.01; Windows 95)" OBSERVED none - 61.2.249.47 SG-HTTP-Service

And my regex is like this:

while (<>){ /(\S*\s*\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s +*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\". +*\")\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)/; #print "\n"; #printf "\ndate time = %s",$1; #printf "\ntime taken = %s",$2; #printf "\nc-ip = %s",$3; #printf "\nsc-status = %s",$4; #printf "\ns-action = %s",$5; #printf "\nsc-bytes = %s",$6; #printf "\ncs-bytes = %s",$7; #printf "\ncs-method = %s",$8; #printf "\ncs-uri-scheme = %s",$9; #printf "\ncs-host = %s",$10; #printf "\ncs-uri-stem = %s",$11; #printf "\ncs-username = %s",$12; #printf "\ns-hierarchy = %s",$13; #printf "\ns-supplier-name = %s",$14; #printf "\ncs(Content-Type)= %s",$15; #printf "\ncs(User-Agent) = %s",$16; #printf "\nsc-filter-result = %s",$17; #printf "\nsc-filter-category = %s",$18; #printf "\nx-virus-id = %s",$19; #printf "\ns-ip = %s",$20; #printf "\ns-sitename = %s",$21; }

Can you see whether a more explicit regex would speed the parse up?

Thanks,

js1.

Replies are listed 'Best First'.
Re^3: fast greedy regex
by sfink (Deacon) on Jun 08, 2004 at 04:41 UTC
    Ouch. Perhaps all of your log lines are perfectly formatted, but I would still recommend not doing that just in case you somehow have something fail. At least anchor the expression.

    The problem is that because you are using * everywhere, there are an exponential number of ways for that match to fail. Perhaps Perl is clever enough to avoid it, but it seems to me that if you hit a single malformed line, that expression could hang.

    I'd recommend avoiding the issue by using + pretty much everywhere you have a *, and anchoring the ends with ^ and $. Also, as someone else mentioned, it would be better to get rid of all those printf's and replace them with:

    print <<"END_FORMAT"; date time = $1 time taken = $2 c-ip = $3 . . . END_FORMAT

    It also appears that you'd be better off doing a little extra work so that you can use split instead of a regex:

    my ($user_agent) = /\"(.*)\"/; s/\".*\"//; my @F = split(/\s+/); print <<"END_FORMAT"; date time = $F[0] $F[1] time taken = $F[2] . . . cs(Content-Type) = $F[15] cs(User-Agent) = $user_agent sc-filter-result = $F[16] . . . END_FORMAT
    Alternatively, you could try
    my @F = /(\" .*? \" | \S+)/gx;
    (but remember to cut the parens off the relevant item.)

      Many thanks for all the interest and help here. All the replies were useful.

      I really liked these constructs:

      s/\".*\"// my @F = split(/\s+/);

      and

      my @F = /(\" .*? \" | \S+)/gx

      But I found the quickest solution was this:

      while (<>){ $front = substr( $_, 0, index($_, '"' )-1, ""); $back = substr( $_, rindex( $_, '"' )+2); $user_agent = substr ($_, 1, rindex( $_, '"' )); $front=~/^([^#\s]+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+) +\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)$/; print <<END_FORMAT; date time = $1 time taken = $2 c-ip = $3 sc-status = $4 ns-action = $5 sc-bytes = $6 cs-bytes = $7 cs-method = $8 cs-uri-scheme = $9 cs-host = $10 cs-uri-stem = $11 cs-username = $12 s-hierarchy = $13 s-supplier-name = $14 cs(Content-Type) = $15 cs(User-Agent) = $user_agent END_FORMAT $back=~/^(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s*$/; print <<END_FORMAT2 sc-filter-result = $1 sc-filter-category = $2 x-virus-id = $3 s-ip = $4 s-sitename = $5 END_FORMAT2 }

      This processed the following gzip'd log:

      bash-2.05b$ ls -l SG* -rwxr-xr-x 1 js js 106236830 Mar 7 17:02 SG_CSGL02_mai +n_470302220000.log.gz

      in 1 minute 32 sec

      bash-2.05b$ time gzip -dc SG* | ./test.pl >/dev/null 6.96user 0.63system 1:32.95elapsed 8%CPU (0avgtext+0avgdata 0maxreside +nt)k 0inputs+0outputs (94major+33minor)pagefaults 0swaps

      on a 2.6Ghz AMD processor (500MB).

Re^3: fast greedy regex
by Roy Johnson (Monsignor) on Jun 07, 2004 at 22:51 UTC
    I don't see any reason that a more explicit regex would speed it up. The only obvious speed benefit from a more explicit match is that failure could happen sooner.

    For speed, I would expect that one call to print, rather than multiple calls to printf, would be something of a speedup.


    The PerlMonk tr/// Advocate
Re^3: fast greedy regex
by CountZero (Bishop) on Jun 08, 2004 at 06:18 UTC
    One unexpected space somewhere in this string and you are toast.

    So don't re-invent the (broken) wheel and use a module such as Regexp::Log or Logfile or their derived classes.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law