Pic has asked for the wisdom of the Perl Monks concerning the following question:

OK, (mostly) as an intellectual excercise I'm working on writing a simple tokenizer using perl and regexes. Never mind that this might be better done by reading the input in a linear fashion and using some RPN or something like that, I'm using regexes (mainly because I'm rubbish at using 'em). Now, the problem: When splitting the input line into quoted and unquoted data I use the following regex:
m/([^"']+|(?:"(?:[^"]|\\")*"))/g
Now, wanting to make this thing able to use both " and ' as quote characters I tried this:
m/([^"']+|(?:(["'])(?:[^\2]|\\\2)*\2))/g
This however has the rather unfortunate side effect of creating a lot of empty matches as well as (for some reason) matching the closing quote twice (the last time as a single character all by itself).
Can anyone offer some insight as to what on earth I'm doing wrong here?

Thanks in advance,
Arne
:wq

Replies are listed 'Best First'.
Re: Regex weirdness?
by hv (Prior) on Mar 15, 2005 at 01:31 UTC

    I think the problem is in [^\2] - you can't interpolate matches into character classes this way. One way to get around that is to use a negative lookeahead instead.

    Also, this will have problems with 'Don\'t do this', since the backslash will be matched by the character class, which means the escaped quote ends up terminating the string. Check for backslashes first to avoid that:

    m{( [^"']+ | (["']) (?: \\ . | (?!\2) . )* \2 )}xgs

    Hugo

      Hmm. That regex looks far better than mine. I really should learn to use the /x modifier when dealing with regexes of a certain complexity. And yeah, the quote problem has occured to me as being a problem (which is why the linear approach is looking more and more appealing TBH).
      Also, that regex seems to have the problem with recapturing the closing quote I had, as well as getting errors about using undefined values in a match (which Isuppose is due to \2 being unset in the first branch).
Re: Regex weirdness?
by Roy Johnson (Monsignor) on Mar 15, 2005 at 01:03 UTC
    If it takes the first alternative, $2 is an empty match. You might want to split this into two separate matches with \G anchors. Or make the first + a * and remove the bar. Depends on what you need to happen.

    Caution: Contents may have been coded under pressure.
      Of course. What I'm currently using is this:
      m/([^"']+|(?:"(?:[^"]|\\")*")|(?:'(?:[^']|\\')*'))/g
      Which does what I want. I'll look into the \G variant and see if that makes sense in my head (I have occasional problems with wrapping my head around regex stuff). My intent is to split a block of text into a list of elements, alternating between a quoted string (with the quotes) and a non-quoted string. For example the string print ( "some stuff", $more_stuff, "final stuff" ); should become this:
      @list = ( q/print ( /, q/"some stuff"/, q/, $more_stuff, /, q/"final stuff"/, q/ );/ )
        To get a backreference to a quote, you have to put the quote in parens, which means it is going to be returned as a separate group. So I think you're going to have to stay with the separate alternatives for each type of quote.

        The /x option is absolutely straightforward: any whitespace within your regex is ignored. So you can pretty it up as you like. You can also put comments in it. I recommend you jump right into using it.

        The \G anchor tells the pattern to resume looking from where it last left off with the string. I don't think it's going to help you with what you're trying to do here.

        I notice that the backslash-protection of quotes doesn't work with your pattern. Consider that, within quotes, you will accept backslash followed by any character, and any run of non-backslash, non-quote characters. Or, you will accept a minimal match of any character leading up to a quote that is not preceded by a backslash. I illustrate both of these here (along with the use of /x):

        my @matches = m/([^"']+ |(?: " (?:\\.|[^\\"]+)* " ) # Double quote |(?: ' .*? (?<!\\)' ))/gx;
        Update: note that the second version will not recognize that \\' does not protect the quote.

        Caution: Contents may have been coded under pressure.
Re: Regex weirdness?
by ihb (Deacon) on Mar 15, 2005 at 01:33 UTC

    [^\2] does not match everything except what was captured by the second capturing group. Instead, it matches everything except the second ASCII character. What you need to use is a negative lookahead combined with an any-match: (?!\2)(?s:.).

    Note that even with this fix your pattern doesn't handle escaped escape characters.

    ihb

    See perltoc if you don't know which perldoc to read!

Re: Regex weirdness? (Benchmark)
by Anonymous Monk on Mar 15, 2005 at 11:58 UTC
    I've benchmarked four regexes. First the one that you claimed to be your original (which actually isn't quite correct, as it doesn't deal with backslashes correctly). Then Hugo's suggestion. Third, the standard unrolling technique - but one that has different cases for double and single quotes, and finally, one that uses unrolling, and doesn't have different cases for single and double quotes.
    #!/usr/bin/perl use strict; use warnings; use Benchmark qw /cmpthese/; our $orig = qr {( [^"']+ | (?:"(?:[^"]|\\")*") | (?:'(?:[^']|\\')*') )}xs; our $hv = qr {( [^"']+ | (["']) (?: \\ . | (?!\2) . )* \2 )}xs; our $unroll = qr {( [^"']+ | " [^"]* (?: \\. [^"]*)* " | ' [^']* (?: \\. [^']*)* ' )}xs; our $code = qr {( [^"']+ | (["']) (??{ "[^$2]*(?:\\\\.[^$2]*)*" }) \2 )}xs; our $str = `cat /tmp/pp.c`; # pp.c from the perl 5.8.5 sources. cmpthese(-10, { orig => 'while ($str =~ /$orig/g) {1}', hv => 'while ($str =~ /$hv/g) {1}', unroll => 'while ($str =~ /$unroll/g) {1}', code => 'while ($str =~ /$code/g) {1}', }); __END__ Rate orig hv code unroll orig 13.0/s -- -3% -92% -96% hv 13.4/s 3% -- -92% -96% code 166/s 1179% 1139% -- -52% unroll 344/s 2546% 2464% 107% --
    Note the huge benefits of unrolling.
Re: Regex weirdness?
by chas (Priest) on Mar 15, 2005 at 01:13 UTC
    If I feed either aaa"bbb" or "aaa"bbb to your initial match, $1 is respectively aaa or "aaa" and $2 is empty. Can you give an example of what your regexp does?
    chas
    (Update: Now copying and pasting your first regexp again it appears that in the cases I mentioned $1 is "bbb" or bbb and $2 is empty, but $` is aaa or "aaa". Did your first regexp change? What is really confusing me is that if I set @results=m/([^"']+|(?:"(?:[^"]|\\")*"))/g for the cases I mentioned, I do get aaa and "bbb" (or "aaa" and bbb.) I don't see why the match operator doesn't seem to return $1 and $2. Perhaps I should give up for the day...)
    (Update2: To be explicit about what I mean - I ran the script:
    $_=<>; chomp; @results=m/([^"']+|(?:"(?:[^"]|\\")*"))/g; print "$results[0], $results[1]\n"; print "$1, $2\n"; print "$`, $1, $2\n";
    and entered aaa"bbb". The result was:
    aaa, "bbb" "bbb", aaa, "bbb",
    Seems bizarre to me, but perhaps I'm missing something obvious.