in reply to fast greedy regex

It depends on how finely parsed you need it to be. Your first example has basically three subsections. Your second example is much more broken-out. Neither of them captures anything.

If you just wanted the date and time, you could do ($date, $time) = split / /.


The PerlMonk tr/// Advocate

Replies are listed 'Best First'.
Re^2: fast greedy regex
by js1 (Monk) on Jun 07, 2004 at 22:19 UTC

    Just to give an idea of what I'm doing this is a line from the log:

    2004-03-01 22:00:12 2 15.32.17.34 200 TCP_HIT 3140 326 GET http www.wahm.com http://www.wahm.com/images/vote.gif u779479 DEFAULT_PARENT 61.2.249.106 - "Mozilla/4.0 (compatible; MSIE 5.01; Windows 95)" OBSERVED none - 61.2.249.47 SG-HTTP-Service

    And my regex is like this:

    while (<>){ /(\S*\s*\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s +*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\". +*\")\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)\s*(\S*)/; #print "\n"; #printf "\ndate time = %s",$1; #printf "\ntime taken = %s",$2; #printf "\nc-ip = %s",$3; #printf "\nsc-status = %s",$4; #printf "\ns-action = %s",$5; #printf "\nsc-bytes = %s",$6; #printf "\ncs-bytes = %s",$7; #printf "\ncs-method = %s",$8; #printf "\ncs-uri-scheme = %s",$9; #printf "\ncs-host = %s",$10; #printf "\ncs-uri-stem = %s",$11; #printf "\ncs-username = %s",$12; #printf "\ns-hierarchy = %s",$13; #printf "\ns-supplier-name = %s",$14; #printf "\ncs(Content-Type)= %s",$15; #printf "\ncs(User-Agent) = %s",$16; #printf "\nsc-filter-result = %s",$17; #printf "\nsc-filter-category = %s",$18; #printf "\nx-virus-id = %s",$19; #printf "\ns-ip = %s",$20; #printf "\ns-sitename = %s",$21; }

    Can you see whether a more explicit regex would speed the parse up?

    Thanks,

    js1.

      Ouch. Perhaps all of your log lines are perfectly formatted, but I would still recommend not doing that just in case you somehow have something fail. At least anchor the expression.

      The problem is that because you are using * everywhere, there are an exponential number of ways for that match to fail. Perhaps Perl is clever enough to avoid it, but it seems to me that if you hit a single malformed line, that expression could hang.

      I'd recommend avoiding the issue by using + pretty much everywhere you have a *, and anchoring the ends with ^ and $. Also, as someone else mentioned, it would be better to get rid of all those printf's and replace them with:

      print <<"END_FORMAT"; date time = $1 time taken = $2 c-ip = $3 . . . END_FORMAT

      It also appears that you'd be better off doing a little extra work so that you can use split instead of a regex:

      my ($user_agent) = /\"(.*)\"/; s/\".*\"//; my @F = split(/\s+/); print <<"END_FORMAT"; date time = $F[0] $F[1] time taken = $F[2] . . . cs(Content-Type) = $F[15] cs(User-Agent) = $user_agent sc-filter-result = $F[16] . . . END_FORMAT
      Alternatively, you could try
      my @F = /(\" .*? \" | \S+)/gx;
      (but remember to cut the parens off the relevant item.)

        Many thanks for all the interest and help here. All the replies were useful.

        I really liked these constructs:

        s/\".*\"// my @F = split(/\s+/);

        and

        my @F = /(\" .*? \" | \S+)/gx

        But I found the quickest solution was this:

        while (<>){ $front = substr( $_, 0, index($_, '"' )-1, ""); $back = substr( $_, rindex( $_, '"' )+2); $user_agent = substr ($_, 1, rindex( $_, '"' )); $front=~/^([^#\s]+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+) +\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)$/; print <<END_FORMAT; date time = $1 time taken = $2 c-ip = $3 sc-status = $4 ns-action = $5 sc-bytes = $6 cs-bytes = $7 cs-method = $8 cs-uri-scheme = $9 cs-host = $10 cs-uri-stem = $11 cs-username = $12 s-hierarchy = $13 s-supplier-name = $14 cs(Content-Type) = $15 cs(User-Agent) = $user_agent END_FORMAT $back=~/^(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s*$/; print <<END_FORMAT2 sc-filter-result = $1 sc-filter-category = $2 x-virus-id = $3 s-ip = $4 s-sitename = $5 END_FORMAT2 }

        This processed the following gzip'd log:

        bash-2.05b$ ls -l SG* -rwxr-xr-x 1 js js 106236830 Mar 7 17:02 SG_CSGL02_mai +n_470302220000.log.gz

        in 1 minute 32 sec

        bash-2.05b$ time gzip -dc SG* | ./test.pl >/dev/null 6.96user 0.63system 1:32.95elapsed 8%CPU (0avgtext+0avgdata 0maxreside +nt)k 0inputs+0outputs (94major+33minor)pagefaults 0swaps

        on a 2.6Ghz AMD processor (500MB).

      I don't see any reason that a more explicit regex would speed it up. The only obvious speed benefit from a more explicit match is that failure could happen sooner.

      For speed, I would expect that one call to print, rather than multiple calls to printf, would be something of a speedup.


      The PerlMonk tr/// Advocate
      One unexpected space somewhere in this string and you are toast.

      So don't re-invent the (broken) wheel and use a module such as Regexp::Log or Logfile or their derived classes.

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law