I'm not sure what you want to do, is it that you want to find
a slash delimited string, with some trailing, optional chars,
all of it surrounded by brackets? In that case, the hard part
is finding the delimited string. Now, the easy thing is to
use Regexp::Common and use $RE {delimited} {-delim => '/'}, but that doesn't help you much.
So, here I present Abigail's quick guide to delimited string
matching.
Quick Guide To Matching Delimited Strings
=========================================
Suppose you want to match a delimited string, that is, a string starts
with a delimiter, and end with that delimiter, and doesn't have that
delimiter in between, except when it's 'escaped' with a backslash.
Let's try to match a string that is delimited with double quotes.
A first try may be:
/".*"/
that will indeed match a string that starts with a double quote, and
ends with a double quote. But if you match it against the string:
A "red" fish and a "blue" fish.
it will match:
"red" fish and a "blue"
which is not what we wanted. The problem is the greedyness; the given
regex will match from the first to the last double quote. Then we migh
+t
try:
/".*?"/
with the idea that if greediness is our problem, we just match non-gre
+edy.
This will work well, until we try to find a string that has an escaped
delimiter. If we match against:
The book is called "A 5\" dwarf".
we find:
"A 5\"
We haven't taken escaped delimiters into account.
We first have to take a closer look at what a delimited string really
+is.
We start with a delimiter, then we have zero or more times characters
that aren't delimiters *or* an escaped delimiter, and at the end we ha
+ve
a delimiter. Taking that into account, we'll try:
/"([^"]|\\")*"/ # The double \ is needed to escape it in the reg
+ex.
But that still gives us problems when matching that against:
The book is called "A 5\" dwarf".
We still get:
"A 5\"
The problem is that [^"] matches the backslash as, and the right hand
side of the |, the \\" part, is never tried. Putting \\" first appears
to solve our problem, when matching that against
The book is called "A 5\" dwarf".
we do indeed get the wanted
"A 5\" dwarf"
Does that mean this is that
/"(\\"|[^"])*"/
is the correct regexp? No, if we use that regexp against:
A backslash ("\\") and a forwardslash ("/")
it will match:
"\\") and a forwardslash ("
How did that happen? Well, we didn't take into account that the escape
character can be escaped as well. So, the first backslash was matched
by [^"], then the second backslash and the following double quote was
matched by \\" and then we matched all the way till the next double qu
+ote.
The solution lies into not only special casing delimiters, but also
backslashes. A delimited string consists of delimiters at both ends,
and in between either characters that aren't delimiters or escape char
+s,
or pairs of chars where the first one is a delimiter. Or, in a regex:
/"([^"\\]|\\.)*"/
This regex will match the
"\\"
from
A backslash ("\\") and a forwardslash ("/")
Speed
=====
We can speed up the regex
/"([^"\\]|\\.)*"/
a bit. The above regex has to take a branch decision at every characte
+r
between the delimiters. We can speed it up by realizing that between
the delimiters we have zero or more times either a non-empty sequence
+of
characters that aren't double quotes or backslashes, or a backslash
followed by another character. We then arrive at:
/"([^"\\]+|\\.)*"/
This technique is called 'loop unrolling'. And we can even go one step
further, by viewing the string between the delimiters as (possibly emp
+ty)
sequences of non double quote, non backslash characters, separated by
pairs of characters, of which the first is a backslash. Pfeew. A mouth
full. But it gives us:
/"[^"\\]*(\\.[^"\\]*)*"/
Note that this regex doesn't contain a | anymore. So, the regexp machi
+ne
can't first take one branch of which it needs to backtrack.
Final Notes
===========
* Some of the given regexes use the . wildcard. The dot doesn't match
a newline. In cases where you may a newline between your delimiters
+,
use the /s regexp modifier.
* This technique can be extended to delimiters that are longer than o
+ne
character, and/or where the opening delimiter is different than the
closing delimiter.
Abigail | [reply] [d/l] [select] |
I mentioned a possible speed-up when 'unrolling' the regexp.
Here's a small benchmark backing up that claim:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark 'cmpthese';
our $re1 = qr /"([^"\\]|\\.)*"/;
our $re2 = qr /"([^"\\]+|\\.)*"/;
our $re3 = qr /"[^"\\]*(\\.[^"\\]*)*"/;
chomp (our @data = <DATA>);
cmpthese -5 => {
plain => '/$re1/ for @data',
unroll1 => '/$re2/ for @data',
unroll2 => '/$re3/ for @data',
};
__DATA__
A "red" fish
A "red" fish and a "blue" fish
The book is called "A 5\" dwarf"
A backslash ("\\") and a forwardslash ("/")
No delimiters here at all.
The results:
Rate plain unroll1 unroll2
plain 50621/s -- -20% -38%
unroll1 63260/s 25% -- -22%
unroll2 81507/s 61% 29% --
Abigail | [reply] [d/l] [select] |
Could it be that the "\]" in
/.*?[^\]/[$regex_option_list]* #slash delimited regex
is confusing the parser? You have the meta-delimiters set to escaped square brackets, so that would be the first place I'd look.
Just my $.02.
--
tbone1
Ain't enough 'O's in 'stoopid' to describe that guy.
- Dave "the King" Wilson
| [reply] [d/l] |
This was the problem, thank you...
I was having trouble thinking through some of the escaping. It does need to be [^\\]\/ at the end.
For the curious: I am implementing and "ed" replacement that will make use of perl6ish regexen and a number of other features. This is a piece from my grammar parser. the real regex part looks like:
m<(?::[$regex_options]*)?[?/].*?[^\\][?/]>
| [reply] [d/l] [select] |