theAcolyte has asked for the wisdom of the Perl Monks concerning the following question:

Bonjour! (feeling international today)

I've got a regEx which is working fine, but it seems, somehow, inelegant. I've been pouring through documentation to find a better way to write this, but I haven't come up with anything yet.

While I've wrapped my mind around the basics, I'm always striving to get better, so I thought I'd ask: is there a better way to do this?

Here is the (single line) I'm matching against:

[Tue Apr 20 04:51:19 2004] [error] [client 68.103.45.137] File does no +t exist: /home/virtual/site6/fst/var/www/html/robots.txt

And here is my regEx:

\[client[\w\s.]{12,16}\](.*?$)

$1 captures everything which is the actual error, in this examples case "File does not exist .... " to the end.

I had thought there must be a way to say "match the very last ] you find, til the end of the line, but I haven't been able to figure it out. Just seems wasteful to me to have to match all that client 102.102.102.102 stuff to grab the part I actually want.

So, anyone care to enlighten me on the more elegant solution? Eagerly awaiting your Monkish reponses ...

theAcolyte

Replies are listed 'Best First'.
Re: Is there a better way to write this RegEx?
by japhy (Canon) on Apr 20, 2004 at 12:55 UTC
    You could do: my ($error) = $line =~ /.*](.*)/; That basically matches everything after the last closing bracket in the line. However, you could just use functions for that: my $error = substr($line, rindex($line, ']') + 1); Or consider using a log parser.
    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
      Of course the flaw in this whole technique is if the error message actually contains a right-bracket, you won't get the whole thing. If you know this will never happen, that won't be a problem, but the original code is somewhat more robust in the face of unknown input.
      That seems to work, despite the fact I was under the impression that the first .* would gobble up the entire line, being greedy and all.

      Also, I guess I should have said I'm attempting to avoid .* ... learning experience, I guess. but, yes, this seems like a much more elegant solution then the one I had written by far.

      I'm still trying to figure out why the .* doesn't grab the entire line ... because of the second .*, I guess?

      And, I'm going to see if I can find rindex in the perldocs because, frankly, I've never seen/heard of it before. :)

      - Erik
      theAcolyte

      ps. I know there are log parsers on CPAN, but I'm just mucking around ... i have a full script doing what I want and working just fine ... but I noticed how ugly my regEx was and wanted to see how to improve on it :)

        The reason the first .* doesn't match the whole string is because the regex is like a persistent ex, it doesn't want to lose. Regexes try hard to match. The regex /.*](.*)/ matches as much of the string as possible, and then tries to match a bracket. When it realizes it can't, it backs up to the last bracket it passed, and then tries matching the rest of the regex. This process is called "backtracking" and is an integral part of any regular expression engine.

        I could tell you that it backtracks one character at a time until it finds a bracket, but that's not true. It's optimized in a case like this to jump backwards to the bracket all at once.

        _____________________________________________________
        Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
Re: Is there a better way to write this RegEx?
by pbeckingham (Parson) on Apr 20, 2004 at 13:16 UTC

    This captures a greedy block of non-], following a ] and some space, until the end of the line:

    /\]\s*([^\]]+)$/