Denis.Beurive has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am using « negative lookbehind » pattern matching, and I get a very strange result.

Here is the simple code:

use strict; use warnings; my $s = '"toto"'; if ($s =~ /^(?<!\\)"(((?<=\\)"|[^"])+)(?<!\\)"$/) { print "It matches!\n"; print $1 . "\n"; print $2 . "\n"; }

And this is the result:

It matches! toto o

????

Where does the « o » come from ?

Any idea ?

Best regards

Denis

Replies are listed 'Best First'.
Re: negative lookbehind and VERY strange capture
by Corion (Patriarch) on Sep 18, 2016 at 11:17 UTC

    The $2 is filled repeatedly by ((?<=\\)"|[^"])+, and the last thing it matched was the o at the end of toto.

    Also, it looks as if you are trying to parse quoted constructs. Have you considered what should happen for the following strings:

    "Toto\"ro" "Toto\\Africa" "Toto\\"

    Personally, I prefer the following approach for quoted constructs with backslash escaping instead of dealing lookbehind:

    ^"((?:[^"\]+|\\["\\]))"$

    that is, "anything that is not a quote or a backslash", or "a backslash, followed by another backslash, or a quote"

      Hello Corion

      Thank you very much for your suggestion! It helps a lot!

      my @tests = ( '"abcd\\\\efgh"', '"abcd\\""', '"abcd\\"efgh"', '"abcd\\\\\\"efgh"', '"abcd\\\\i\\"efgh"', '"abcd\\\\"', ); foreach my $test (@tests) { print "Try for \"$test\":\n"; if ($test =~ /^"((?:[^"\\]|\\["\\])+)"$/) { print "It matches!\n"; print '$1: ' . $1 . "\n"; print '$2: ' . (defined($2) ? "\$2 is defined\n" : "\$2 is NOT def +ined\n"); } else { print "It does not match!\n"; } print "\n"; }

      Result:

      Try for ""abcd\\efgh"": It matches! $1: abcd\\efgh $2: $2 is NOT defined Try for ""abcd\""": It matches! $1: abcd\" $2: $2 is NOT defined Try for ""abcd\"efgh"": It matches! $1: abcd\"efgh $2: $2 is NOT defined Try for ""abcd\\\"efgh"": It matches! $1: abcd\\\"efgh $2: $2 is NOT defined Try for ""abcd\\i\"efgh"": It matches! $1: abcd\\i\"efgh $2: $2 is NOT defined Try for ""abcd\\"": It matches! $1: abcd\\ $2: $2 is NOT defined

      Best regards

      Denis

        Glad to hear you have a working solution. Might I also suggest that for test scripts like this you consider using one of the Test::* frameworks? They will help to highlight where your matches fail. Here's an example using the ubiquitous Test::More to show how simple it would be to integrate.

        #!/usr/bin/perl use strict; use warnings; use Test::More; # Set up source strings (keys) and expected results (values) my %tests = ( '"abcd\\\\efgh"' => 'abcd\\\\efgh', '"abcd\\""' => 'abcd\\"', '"abcd\\"efgh"' => 'abcd\\"efgh', '"abcd\\\\\\"efgh"' => 'abcd\\\\\\"efgh', '"abcd\\\\i\\"efgh"' => 'abcd\\\\i\\"efgh', '"abcd\\\\"' => 'abcd\\\\' ); # Set the total number of tests to perform plan tests => 3 * keys %tests; while ( my ($test, $exp) = each %tests) { ok ($test =~ /^"((?:[^"\\]|\\["\\])+)"$/, "$test matches"); is ($1, $exp, "\$1 is $exp"); is ($2, undef, '$2 is undefined'); }

        If any of the tests fail it is easier to spot than having to visually parse the script output. You can also run prove on the script to get just a summary which is even clearer.

        If you write a lot of scripts like this, it is well worth becoming familiar with the wealth of testing modules available.

        FWIW, please note that  /^"((?:[^"\\]|\\["\\])+)"$/ does not match the empty string:

        c:\@Work\Perl\monks>perl -wMstrict -le "my $emptystring = '\"\"'; print '+ match' if $emptystring =~ /^\"((?:[^^\"\\]|\\[\"\\])+)\"$/; print '* match' if $emptystring =~ /^\"((?:[^^\"\\]|\\[\"\\])*)\"$/; " * match
        (Please forgive all the extra backslashes and ignore the extra  ^ in [^^\"\\]. These are Windoze command line artifacts.)


        Give a man a fish:  <%-{-{-{-<

Re: negative lookbehind and VERY strange capture
by BrowserUk (Patriarch) on Sep 18, 2016 at 11:22 UTC

    Expanding out your regex it becomes obvious

    use strict; use warnings; my $s = '"toto"'; if ($s =~ m[^ (?<!\\)" ( ( (?<=\\)"|[^"] )+ # $2 ) # $1 (?<!\\)" $]x ) { print "It matches!\n"; print $1 . "\n"; print $2 . "\n"; }

    You have two nested pairs of capturing parens, the inner pair with a repeat. So the outer pair captures everything matched by the inner repetition, and the inner pair captures the last thing it matches. Hence "toto" and the last character of that "o".


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Hello BrowserUK,

      Thank you very much for your answer. I just learn something. I put « non-capturing parentheses » and it did the trick :

      my $s = '"abcd"'; print "Try for \"$s\":\n"; if ($s =~ /^(?<!\\)"((?:(?<=\\)"|[^"])*)(?<!\\)"$/) { print "It matches!\n"; print '$1: ' . $1 . "\n"; print '$2: ' . (defined($2) ? "\$2 is defined\n" : "\$2 is NOT defin +ed\n"); } else { print "It does not match!\n"; }

      However, as Corion pointed out, my regular expression does not work for this string « abcd\\ ». I’ll try some new approaches.

      Best regards

      Denis

Re: negative lookbehind and VERY strange capture
by BillKSmith (Monsignor) on Sep 18, 2016 at 12:26 UTC
Re: negative lookbehind and VERY strange capture
by Denis.Beurive (Initiate) on Sep 18, 2016 at 11:31 UTC
    Thank you very much for your answers. I’ll take some time to fully understand. Best Regards, Denis