in reply to regex regexen

I'm not sure what you want to do, is it that you want to find a slash delimited string, with some trailing, optional chars, all of it surrounded by brackets? In that case, the hard part is finding the delimited string. Now, the easy thing is to use Regexp::Common and use $RE {delimited} {-delim => '/'}, but that doesn't help you much.

So, here I present Abigail's quick guide to delimited string matching.

Quick Guide To Matching Delimited Strings ========================================= Suppose you want to match a delimited string, that is, a string starts with a delimiter, and end with that delimiter, and doesn't have that delimiter in between, except when it's 'escaped' with a backslash. Let's try to match a string that is delimited with double quotes. A first try may be: /".*"/ that will indeed match a string that starts with a double quote, and ends with a double quote. But if you match it against the string: A "red" fish and a "blue" fish. it will match: "red" fish and a "blue" which is not what we wanted. The problem is the greedyness; the given regex will match from the first to the last double quote. Then we migh +t try: /".*?"/ with the idea that if greediness is our problem, we just match non-gre +edy. This will work well, until we try to find a string that has an escaped delimiter. If we match against: The book is called "A 5\" dwarf". we find: "A 5\" We haven't taken escaped delimiters into account. We first have to take a closer look at what a delimited string really +is. We start with a delimiter, then we have zero or more times characters that aren't delimiters *or* an escaped delimiter, and at the end we ha +ve a delimiter. Taking that into account, we'll try: /"([^"]|\\")*"/ # The double \ is needed to escape it in the reg +ex. But that still gives us problems when matching that against: The book is called "A 5\" dwarf". We still get: "A 5\" The problem is that [^"] matches the backslash as, and the right hand side of the |, the \\" part, is never tried. Putting \\" first appears to solve our problem, when matching that against The book is called "A 5\" dwarf". we do indeed get the wanted "A 5\" dwarf" Does that mean this is that /"(\\"|[^"])*"/ is the correct regexp? No, if we use that regexp against: A backslash ("\\") and a forwardslash ("/") it will match: "\\") and a forwardslash (" How did that happen? Well, we didn't take into account that the escape character can be escaped as well. So, the first backslash was matched by [^"], then the second backslash and the following double quote was matched by \\" and then we matched all the way till the next double qu +ote. The solution lies into not only special casing delimiters, but also backslashes. A delimited string consists of delimiters at both ends, and in between either characters that aren't delimiters or escape char +s, or pairs of chars where the first one is a delimiter. Or, in a regex: /"([^"\\]|\\.)*"/ This regex will match the "\\" from A backslash ("\\") and a forwardslash ("/") Speed ===== We can speed up the regex /"([^"\\]|\\.)*"/ a bit. The above regex has to take a branch decision at every characte +r between the delimiters. We can speed it up by realizing that between the delimiters we have zero or more times either a non-empty sequence +of characters that aren't double quotes or backslashes, or a backslash followed by another character. We then arrive at: /"([^"\\]+|\\.)*"/ This technique is called 'loop unrolling'. And we can even go one step further, by viewing the string between the delimiters as (possibly emp +ty) sequences of non double quote, non backslash characters, separated by pairs of characters, of which the first is a backslash. Pfeew. A mouth full. But it gives us: /"[^"\\]*(\\.[^"\\]*)*"/ Note that this regex doesn't contain a | anymore. So, the regexp machi +ne can't first take one branch of which it needs to backtrack. Final Notes =========== * Some of the given regexes use the . wildcard. The dot doesn't match a newline. In cases where you may a newline between your delimiters +, use the /s regexp modifier. * This technique can be extended to delimiters that are longer than o +ne character, and/or where the opening delimiter is different than the closing delimiter.

Abigail

Replies are listed 'Best First'.
Re: regex regexen
by Abigail-II (Bishop) on Oct 31, 2003 at 10:57 UTC
    I mentioned a possible speed-up when 'unrolling' the regexp. Here's a small benchmark backing up that claim:
    #!/usr/bin/perl use strict; use warnings; use Benchmark 'cmpthese'; our $re1 = qr /"([^"\\]|\\.)*"/; our $re2 = qr /"([^"\\]+|\\.)*"/; our $re3 = qr /"[^"\\]*(\\.[^"\\]*)*"/; chomp (our @data = <DATA>); cmpthese -5 => { plain => '/$re1/ for @data', unroll1 => '/$re2/ for @data', unroll2 => '/$re3/ for @data', }; __DATA__ A "red" fish A "red" fish and a "blue" fish The book is called "A 5\" dwarf" A backslash ("\\") and a forwardslash ("/") No delimiters here at all.

    The results:

    Rate plain unroll1 unroll2 plain 50621/s -- -20% -38% unroll1 63260/s 25% -- -22% unroll2 81507/s 61% 29% --

    Abigail