in reply to Re: Text::ParseWords regex doesn't work when text is too long? (fixes)
in thread Text::ParseWords regex doesn't work when text is too long?

Unless they did something recently to radically break backward compatibility, [^\1\\] means "anything except a control-A or a backslash".

In other words, in the words of the Inigo Montoya in Princess Bride, "I don't think that means what you think that means".

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.


Update: verified that:
"\1" =~ /[^\1]/
fails, while
"\1X" =~ /[^\1]/
succeeds in Perl 5.8, validating my original hypothesis at least for the latest public Perl release.

Replies are listed 'Best First'.
regex bottom line?
by edan (Curate) on May 12, 2003 at 09:58 UTC

    So, assuming that I'll need to roll my own parse_line by modifying the regex... what regex will provide the same functionality but work for arbitrarily large strings?

    Since I still don't really understand what /(?!\1)[^\\]/ does, I am having trouble with this... I reason that it should match anything that's not a quote (whichever quote was opened at the start of the match), but I don't see how it does this...

    Should I use tye's first regex? I also don't get how /((?:\\.|[^'"\\]+|(?!\1)['"])*)/ works...
    Does
    /[^'"\\]+|(?!\1)['"]/
    do the same thing as
    /(?!\1)[^\\]/
    ?

    --
    3dan

      The only method that supports arbitrary strings is the last one, as I demonstrated.

      Does
      /[^'"\\]+|(?!\1)['"]­/
      do the same thing as
      /(?!\1)[^\\]/
      ?

      No. But /[^'"\\]|(?!\1)['"]­/ (note that I removed the "+") and /(?!\1)[^\\]/ are the same (provided \1 is either "'" or '"'). That is, they each match a single character that is not a backslash (\), nor the same as the quote character in \1.

      Since the regex is matching zero or more occurrences of X or Y or Z, it also works to match zero or more occurrences of X or Y+ or Z.

      Replacing Y with Y+ means we can grab tons of "uninteresting" characters quickly so that we don't have to loop through the surrounding (?: ... )* so many times (since we've seen that we are only allowed to loop through it 32k times).

                      - tye