edan has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

After spending quite a while running and re-running my program with perl -d, I finally narrowed down this problem I'm having with Text::ParseWords, whereby the comma-separated, quoted string I'm passing to parse_line doesn't get parsed (I get back an empty list).

Can someone please tell me why the following code behaves this way?

my $string = "'" . 'v' x 35_000 . "z'"; print "length of the string is: ", length($string), "\n"; my ($quote, $quoted); ($quote, $quoted) = $string =~ m/^(["'])(.*)\1/; print "dot-star works!\n" if $quoted; # a copy of part of the Text::ParseWords::parse_line() regex ($quote, $quoted) = $string =~ m/^(["'])((?:\\.|(?!\1)[^\\])*)\1/; print "Text::ParseWords fails!\n" unless $quoted;

I don't know if the results are system-dependent, but If I shorten the string to 30,000 characters, the second reg-ex works, too.Is there some sort of size limit with zero-width negative look-ahead assertions or character-classes that I didn't know about?

Any ideas?

--
3dan

Replies are listed 'Best First'.
Re: Text::ParseWords regex doesn't work when text is too long?
by PodMaster (Abbot) on May 11, 2003 at 17:08 UTC
    Turn on warnings and you'll get Complex regular subexpression recursion limit (32766) exceeded (Check perldiag for more info).
    It's best you find a simpler way of tokenizing (split on something like /(?<!\\)"/).


    MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
    I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
    ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: Text::ParseWords regex doesn't work when text is too long? (fixes)
by tye (Sage) on May 11, 2003 at 17:44 UTC

    This isn't too hard to fix:

    my( $quote, $quoted, $end )= $string =~ /(['"])((?:\\.|[^'"\\]+|(?!\1)['"])*)(\1?)/; die "Unclosed quote: $quote$quoted\n" if $quote && ! $end;
    You can (not) also use the simpler:
    /(['"])((?:\\.|[^\1\\]+)*)(\1?)/
    but I suspect that [^\1] didn't work in some slightly older versions of Perl (especially since the original regular expression goes out of its way to avoid it).

    If you have a string that contains a huge sequence of backquoted characters, then you might have to add a + to that part of the regex as well:

    /(['"])((?:(?:\\.)+|[^\1\\]+)*)(\1?)/
    (rather, use this corrected one
    /(['"])((?:(?:\\.)+|[^'"\\]+|(?!\1)['"])*)(\1?)/
    ). Though that still breaks on
    "'" . '\vv'x35_000 . "z'"
    which would force you to do something more like (updated):
    my( $quote, $quoted ); if( $str =~ /(['"])/g ) { my $beg= pos($str); $quote= $1; if( $str !~ /(?<!\\)((?:\\\\)*)\Q$quote/g ) { die "Unclosed quote: ", substr($str,$beg), $/; } my $end= pos($str); $quoted= substr( $str, $beg, $end-$beg-1 ); }
    (:

    Update: Thanks, merlyn. I knew that had failed in my previous testing but had also run into people thinking it should work enough times that when it "worked" in my test case that didn't test that part of it at all, I jumped to the wrong conclusion.

                    - tye
      Unless they did something recently to radically break backward compatibility, [^\1\\] means "anything except a control-A or a backslash".

      In other words, in the words of the Inigo Montoya in Princess Bride, "I don't think that means what you think that means".

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.


      Update: verified that:
      "\1" =~ /[^\1]/
      fails, while
      "\1X" =~ /[^\1]/
      succeeds in Perl 5.8, validating my original hypothesis at least for the latest public Perl release.

        So, assuming that I'll need to roll my own parse_line by modifying the regex... what regex will provide the same functionality but work for arbitrarily large strings?

        Since I still don't really understand what /(?!\1)[^\\]/ does, I am having trouble with this... I reason that it should match anything that's not a quote (whichever quote was opened at the start of the match), but I don't see how it does this...

        Should I use tye's first regex? I also don't get how /((?:\\.|[^'"\\]+|(?!\1)['"])*)/ works...
        Does
        /[^'"\\]+|(?!\1)['"]/
        do the same thing as
        /(?!\1)[^\\]/
        ?

        --
        3dan
Re: Text::ParseWords regex doesn't work when text is too long?
by benn (Vicar) on May 11, 2003 at 17:16 UTC
    Indeed - happens for me too - that Text::ParseWords regex doesn't like strings longer than 32,768 - core dumps on a RedHat box (Perl 5.8) and just plain fails on a Windows ActiveState 5.8 build.

    :( Ben.

    PS - I notice search.cpan.org has Text::ParseWords 3.1, while my builds have 3.21 ...has it been 'rolled back'? or just in the CORE...doh!

    Update or just listen to da PodMasta