Re: Text::ParseWords regex doesn't work when text is too long? (fixes)

This isn't too hard to fix:

    my( $quote, $quoted, $end )=  $string =~
        /(['"])((?:\\.|[^'"\\]+|(?!\1)['"])*)(\1?)/;
    die "Unclosed quote: $quote$quoted\n"
        if  $quote  &&  ! $end;
[download]

You can (not) also use the simpler:

     /(['"])((?:\\.|[^\1\\]+)*)(\1?)/
[download]

but I suspect that [^\1] didn't work in some slightly older versions of Perl (especially since the original regular expression goes out of its way to avoid it).

If you have a string that contains a huge sequence of backquoted characters, then you might have to add a + to that part of the regex as well:

    /(['"])((?:(?:\\.)+|[^\1\\]+)*)(\1?)/
[download]

(rather, use this corrected one

    /(['"])((?:(?:\\.)+|[^'"\\]+|(?!\1)['"])*)(\1?)/
[download]

). Though that still breaks on

    "'" . '\vv'x35_000 . "z'"
[download]

which would force you to do something more like (updated):

    my( $quote, $quoted );
    if(  $str =~ /(['"])/g  ) {
        my $beg= pos($str);
        $quote= $1;
        if(  $str !~ /(?<!\\)((?:\\\\)*)\Q$quote/g  ) {
            die "Unclosed quote: ", substr($str,$beg), $/;
        }
        my $end= pos($str);
        $quoted= substr( $str, $beg, $end-$beg-1 );
    }
[download]

Update: Thanks, merlyn. I knew that had failed in my previous testing but had also run into people thinking it should work enough times that when it "worked" in my test case that didn't test that part of it at all, I jumped to the wrong conclusion.

- tye

Comment on Re: Text::ParseWords regex doesn't work when text is too long? (fixes) Select or Download Code

Replies are listed 'Best First'.
•Re: Re: Text::ParseWords regex doesn't work when text is too long? (fixes) by merlyn (Sage) on May 11, 2003 at 18:47 UTC
Unless they did something recently to radically break backward compatibility, `[^\1\\]` means "anything except a control-A or a backslash". In other words, in the words of the Inigo Montoya in Princess Bride, "I don't think that means what you think that means". -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply. Update: verified that: `"\1" =~ /[^\1]/` [download] fails, while `"\1X" =~ /[^\1]/` [download] succeeds in Perl 5.8, validating my original hypothesis at least for the latest public Perl release.	[reply] [d/l] [select]
regex bottom line? by edan (Curate) on May 12, 2003 at 09:58 UTC
So, assuming that I'll need to roll my own `parse_line` by modifying the regex... what regex will provide the same functionality but work for arbitrarily large strings? Since I still don't really understand what `/(?!\1)[^\\]/` does, I am having trouble with this... I reason that it should match anything that's not a quote (whichever quote was opened at the start of the match), but I don't see how it does this... Should I use tye's first regex? I also don't get how `/((?:\\.\|[^'"\\]+\|(?!\1)['"])*)/` works... Does `/[^'"\\]+\|(?!\1)['"]/` do the same thing as `/(?!\1)[^\\]/` ? -- 3dan	[reply] [d/l] [select]
Re: regex bottom line? (bottom method) by tye (Sage) on May 12, 2003 at 16:13 UTC
The only method that supports arbitrary strings is the last one, as I demonstrated. Does `/[^'"\\]+\|(?!\1)['"]/` [download] do the same thing as `/(?!\1)[^\\]/` [download] ? No. But `/[^'"\\]\|(?!\1)['"]/` (note that I removed the "+") and `/(?!\1)[^\\]/` are the same (provided \1 is either `"'"` or `'"'`). That is, they each match a single character that is not a backslash (\), nor the same as the quote character in \1. Since the regex is matching zero or more occurrences of X or Y or Z, it also works to match zero or more occurrences of X or Y+ or Z. Replacing Y with Y+ means we can grab tons of "uninteresting" characters quickly so that we don't have to loop through the surrounding `(?: ... )*` so many times (since we've seen that we are only allowed to loop through it 32k times). - tye	[reply] [d/l] [select]