in reply to really large regex misbehaving

I have constructed a regex which is far more readable, and does the job (on some simple test cases from your post).

It breaks the regex into three parts: single-quoted strings, double-quoted strings, and all others. The single- and double-quoted string parts are very similar. The logic used is:

If that's not possible, then we use the other part. This is a lengthy post, so...

$REx = qr{ ' (?> [^'\\?]* ) (?: (?: (?: \\ | \?\?/ ) . | \?\?' | \? (?! \? ['/] ) ) (?> [^'\\?]* ) )* ' | " (?> [^"\\?]* ) (?: (?: (?: \\ | \?\?/ ) . | \?\?' | \? (?! \? ['/] ) ) (?> [^"\?]* ) )* " | (?: (?! / [/*] ) (?: \?\?['/] | \? (?! \? ['/] ) | (?> [^?'"\s]+ ) ) )+ }x;
And here's the explain output:
(?x-ims: # group, but do not capture (disregarding # whitespace and comments) (case-sensitive) # (with ^ and $ matching normally) (with . not # matching \n): ' # '\'' (?> # match (and do not backtrack afterwards): [^'\\?]* # any character except: ''', '\\', '?' (0 # or more times (matching the most amount # possible)) ) # end of look-ahead (?x: # group, but do not capture (0 or more times # (matching the most amount possible)): (?x: # group, but do not capture: (?x: # group, but do not capture: \\ # '\' | # OR \? # '?' \? # '?' / # '/' ) # end of grouping . # any character except \n | # OR \? # '?' \? # '?' ' # '\'' | # OR \? # '?' (?! # look ahead to see if there is not: \? # '?' ['/] # any character of: ''', '/' ) # end of look-ahead ) # end of grouping (?> # match (and do not backtrack afterwards): [^'\\?]* # any character except: ''', '\\', '?' # (0 or more times (matching the most # amount possible)) ) # end of look-ahead )* # end of grouping ' # '\'' | # OR " # '"' (?> # match (and do not backtrack afterwards): [^"\\?]* # any character except: '"', '\\', '?' (0 # or more times (matching the most amount # possible)) ) # end of look-ahead (?x: # group, but do not capture (0 or more times # (matching the most amount possible)): (?x: # group, but do not capture: (?x: # group, but do not capture: \\ # '\' | # OR \? # '?' \? # '?' / # '/' ) # end of grouping . # any character except \n | # OR \? # '?' \? # '?' ' # '\'' | # OR \? # '?' (?! # look ahead to see if there is not: \? # '?' ['/] # any character of: ''', '/' ) # end of look-ahead ) # end of grouping (?> # match (and do not backtrack afterwards): [^"\?]* # any character except: '"', '\?' (0 or # more times (matching the most amount # possible)) ) # end of look-ahead )* # end of grouping " # '"' | # OR (?x: # group, but do not capture (1 or more times # (matching the most amount possible)): (?! # look ahead to see if there is not: / # '/' [/*] # any character of: '/', '*' ) # end of look-ahead (?x: # group, but do not capture: \? # '?' \? # '?' ['/] # any character of: ''', '/' | # OR \? # '?' (?! # look ahead to see if there is not: \? # '?' ['/] # any character of ''', '/' ) # end of look-ahead | # OR (?> # match (and do not backtrack # afterwards): [^?'"\s]+ # any character except: '?', ''', '"', # whitespace (\n, \r, \t, \f, and " ") # (1 or more times (matching the most # amount possible)) ) # end of look-ahead ) # end of grouping )+ # end of grouping ) # end of grouping


japhy -- Perl and Regex Hacker

Replies are listed 'Best First'.
Re: Re: really large regex misbehaving - WTF
by Anonymous Monk on May 22, 2001 at 23:41 UTC

    My tests agree that this works. Thank you so very much! I will have to learn these extended regex functions better.... Also thanks for the tip on the explain package.

    These lookahead functions appear to go beyond the computational power of traditional regular expressions. (At least, I can't think of a way to implement them fully using normal regex's.) I am starting to wonder whether I was trying to literally do the impossible, though I suspect there is a "pure" regex that could do the job.