comment on

I have constructed a regex which is far more readable, and does the job (on some simple test cases from your post).

It breaks the regex into three parts: single-quoted strings, double-quoted strings, and all others. The single- and double-quoted string parts are very similar. The logic used is:

match a quote
match as many non-quote, non-backslash, non-question-mark characters as possible
then, as many times as possible...
- match the \\ or ??/ escape sequence and a character, OR the ??' sequence, OR a ? that isn't part of an escape sequence
- match as many non-quote, non-backslash, non-question-mark characters as possible
match the ending quote

If that's not possible, then we use the other part.

one or more times, match...
- as long as we aren't about to match a // or /*...
- a ??' or ??/, OR a ? that's not part of an escape sequence, OR one or more non-question-marks, non-quotes, and non-whitespace

This is a lengthy post, so...

$REx = qr{
  '
  (?> [^'\\?]* )
  (?:
    (?:
      (?: \\ | \?\?/ ) .
      |
      \?\?'
      |
      \? (?! \? ['/] )
    )
    (?> [^'\\?]* )
  )*
  '
  
  |
  
  "
  (?> [^"\\?]* )
  (?:
    (?:
      (?: \\ | \?\?/ ) .
      |
      \?\?'
      |
      \? (?! \? ['/] )
    )
    (?> [^"\?]* )
  )*
  "

  |
  
  (?:
    (?! / [/*] )
    (?:
      \?\?['/]
      |
      \? (?! \? ['/] )
      |
      (?> [^?'"\s]+ )
    )
  )+
  
}x;
[download]

And here's the explain output:

(?x-ims:               # group, but do not capture (disregarding
                       # whitespace and comments) (case-sensitive)
                       # (with ^ and $ matching normally) (with . not
                       # matching \n):

  '                      # '\''

  (?>                    # match (and do not backtrack afterwards):

    [^'\\?]*               # any character except: ''', '\\', '?' (0
                           # or more times (matching the most amount
                           # possible))

  )                      # end of look-ahead

  (?x:                   # group, but do not capture (0 or more times
                         # (matching the most amount possible)):

    (?x:                   # group, but do not capture:

      (?x:                   # group, but do not capture:

        \\                     # '\'

       |                      # OR

        \?                     # '?'

        \?                     # '?'

        /                      # '/'

      )                      # end of grouping

      .                      # any character except \n

     |                      # OR

      \?                     # '?'

      \?                     # '?'

      '                      # '\''

     |                      # OR

      \?                     # '?'

      (?!                    # look ahead to see if there is not:

        \?                     # '?'

        ['/]                   # any character of: ''', '/'

      )                      # end of look-ahead

    )                      # end of grouping

    (?>                    # match (and do not backtrack afterwards):

      [^'\\?]*               # any character except: ''', '\\', '?'
                             # (0 or more times (matching the most
                             # amount possible))

    )                      # end of look-ahead

  )*                     # end of grouping

  '                      # '\''

 |                      # OR

  "                      # '"'

  (?>                    # match (and do not backtrack afterwards):

    [^"\\?]*               # any character except: '"', '\\', '?' (0
                           # or more times (matching the most amount
                           # possible))

  )                      # end of look-ahead

  (?x:                   # group, but do not capture (0 or more times
                         # (matching the most amount possible)):

    (?x:                   # group, but do not capture:

      (?x:                   # group, but do not capture:

        \\                     # '\'

       |                      # OR

        \?                     # '?'

        \?                     # '?'

        /                      # '/'

      )                      # end of grouping

      .                      # any character except \n

     |                      # OR

      \?                     # '?'

      \?                     # '?'

      '                      # '\''

     |                      # OR

      \?                     # '?'

      (?!                    # look ahead to see if there is not:

        \?                     # '?'

        ['/]                   # any character of: ''', '/'

      )                      # end of look-ahead

    )                      # end of grouping

    (?>                    # match (and do not backtrack afterwards):

      [^"\?]*                # any character except: '"', '\?' (0 or
                             # more times (matching the most amount
                             # possible))

    )                      # end of look-ahead

  )*                     # end of grouping

  "                      # '"'

 |                      # OR

  (?x:                   # group, but do not capture (1 or more times
                         # (matching the most amount possible)):

    (?!                    # look ahead to see if there is not:

      /                      # '/'

      [/*]                   # any character of: '/', '*'

    )                      # end of look-ahead

    (?x:                   # group, but do not capture:

      \?                     # '?'

      \?                     # '?'

      ['/]                   # any character of: ''', '/'

     |                      # OR

      \?                     # '?'

      (?!                    # look ahead to see if there is not:

        \?                     # '?'

        ['/]                  # any character of ''', '/'

      )                      # end of look-ahead

     |                      # OR

      (?>                    # match (and do not backtrack
                             # afterwards):

        [^?'"\s]+              # any character except: '?', ''', '"',
                               # whitespace (\n, \r, \t, \f, and " ")
                               # (1 or more times (matching the most
                               # amount possible))

      )                      # end of look-ahead

    )                      # end of grouping

  )+                     # end of grouping

)                      # end of grouping
[download]

japhy -- Perl and Regex Hacker

In reply to Re: really large regex misbehaving - WTF by japhy
in thread really large regex misbehaving by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.