comment on

I'm not sure what you want to do, is it that you want to find a slash delimited string, with some trailing, optional chars, all of it surrounded by brackets? In that case, the hard part is finding the delimited string. Now, the easy thing is to use Regexp::Common and use $RE {delimited} {-delim => '/'}, but that doesn't help you much.

So, here I present Abigail's quick guide to delimited string matching.

Quick Guide To Matching Delimited Strings
=========================================

Suppose you want to match a delimited string, that is, a string starts
with a delimiter, and end with that delimiter, and doesn't have that
delimiter in between, except when it's 'escaped' with a backslash.
    
Let's try to match a string that is delimited with double quotes.
A first try may be:

    /".*"/

that will indeed match a string that starts with a double quote, and
ends with a double quote. But if you match it against the string:

    A "red" fish and a "blue" fish.

it will match:

    "red" fish and a "blue"
    
which is not what we wanted. The problem is the greedyness; the given
regex will match from the first to the last double quote. Then we migh
+t
try:

    /".*?"/
    
with the idea that if greediness is our problem, we just match non-gre
+edy.
This will work well, until we try to find a string that has an escaped
delimiter. If we match against:
    
    The book is called "A 5\" dwarf".

we find:

    "A 5\"

We haven't taken escaped delimiters into account.

We first have to take a closer look at what a delimited string really 
+is.
We start with a delimiter, then we have zero or more times characters
that aren't delimiters *or* an escaped delimiter, and at the end we ha
+ve
a delimiter. Taking that into account, we'll try:

    /"([^"]|\\")*"/   # The double \ is needed to escape it in the reg
+ex.
    
But that still gives us problems when matching that against:

    The book is called "A 5\" dwarf".

We still get:
    
    "A 5\"

The problem is that [^"] matches the backslash as, and the right hand
side of the |, the \\" part, is never tried. Putting \\" first appears
to solve our problem, when matching that against

    The book is called "A 5\" dwarf".

we do indeed get the wanted

    "A 5\" dwarf"

Does that mean this is that
    
    /"(\\"|[^"])*"/

is the correct regexp? No, if we use that regexp against:

    A backslash ("\\") and a forwardslash ("/")
    
it will match:

    "\\") and a forwardslash ("

How did that happen? Well, we didn't take into account that the escape
character can be escaped as well. So, the first backslash was matched
by [^"], then the second backslash and the following double quote was
matched by \\" and then we matched all the way till the next double qu
+ote.

The solution lies into not only special casing delimiters, but also
backslashes. A delimited string consists of delimiters at both ends,
and in between either characters that aren't delimiters or escape char
+s,
or pairs of chars where the first one is a delimiter. Or, in a regex:
    
    /"([^"\\]|\\.)*"/

This regex will match the

    "\\"
    
from

    A backslash ("\\") and a forwardslash ("/")


Speed
=====

We can speed up the regex

    /"([^"\\]|\\.)*"/

a bit. The above regex has to take a branch decision at every characte
+r
between the delimiters. We can speed it up by realizing that between
the delimiters we have zero or more times either a non-empty sequence 
+of
characters that aren't double quotes or backslashes, or a backslash
followed by another character. We then arrive at:

    /"([^"\\]+|\\.)*"/
    
This technique is called 'loop unrolling'. And we can even go one step
further, by viewing the string between the delimiters as (possibly emp
+ty)
sequences of non double quote, non backslash characters, separated by
pairs of characters, of which the first is a backslash. Pfeew. A mouth
full. But it gives us:

    /"[^"\\]*(\\.[^"\\]*)*"/

Note that this regex doesn't contain a | anymore. So, the regexp machi
+ne
can't first take one branch of which it needs to backtrack.
    

Final Notes
===========

 * Some of the given regexes use the . wildcard. The dot doesn't match
   a newline. In cases where you may a newline between your delimiters
+,
   use the /s regexp modifier.
    
 * This technique can be extended to delimiters that are longer than o
+ne
   character, and/or where the opening delimiter is different than the
   closing delimiter.
[download]

Abigail

In reply to Re: regex regexen by Abigail-II
in thread regex regexen by mooresan

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.