ig has asked for the wisdom of the Perl Monks concerning the following question:

I am seeking enlightenment to understand the behavior of the following program and to know what tools are available to understand/debug/display what is going on.

My test program:

use strict; use warnings; use Data::Dumper; my @strings = qw(exception:tex exception:mex exception:mex asdf tex:ex +ception:mex asdf exception:mex asdf asdf exception:mex); foreach (1..4) { my @filtered = grep { /exception:/g && !/\Gtex/ } @strings; print Dumper(\@filtered); }

produces the following output (perl 5.10.0 on CentOS 5.3):

$VAR1 = [ 'exception:mex', 'exception:mex', 'tex:exception:mex', 'exception:mex', 'exception:mex' ]; $VAR1 = []; $VAR1 = [ 'exception:mex', 'exception:mex', 'tex:exception:mex', 'exception:mex', 'exception:mex' ]; $VAR1 = [];

I am expecting the same output from each iteration.

One way to get my expected output is to add "pos = 0;".

use strict; use warnings; use Data::Dumper; my @strings = qw(exception:tex exception:mex exception:mex asdf tex:ex +ception:mex asdf exception:mex asdf asdf exception:mex); foreach (1..4) { my @filtered = grep { pos = 0; /exception:/g && !/\Gtex/ } @string +s; print Dumper(\@filtered); }

which produces

$VAR1 = [ 'exception:mex', 'exception:mex', 'tex:exception:mex', 'exception:mex', 'exception:mex' ]; $VAR1 = [ 'exception:mex', 'exception:mex', 'tex:exception:mex', 'exception:mex', 'exception:mex' ]; $VAR1 = [ 'exception:mex', 'exception:mex', 'tex:exception:mex', 'exception:mex', 'exception:mex' ]; $VAR1 = [ 'exception:mex', 'exception:mex', 'tex:exception:mex', 'exception:mex', 'exception:mex' ];

I don't understand the behavior and I don't know what tools are available to easily inspect the behavior. I have used gdb to debug RE code in the past, but it is a bit painful.

Replies are listed 'Best First'.
Re: strange behavior of grep with global match
by moritz (Cardinal) on Aug 07, 2009 at 09:25 UTC
    The problem is that /.../g stores the match position in pos($_), but it is not reset anywhere (because without the /g the regex engine doesn't touch pos at all, iirc). Since $_ is just an alias to a particular array element, each element has its own pow().

    If you add a print pos($strings[1]), "\n" in the loop (after the grep, in your first example) you'll see that it first prints 10, in the second iteration it's undef, then 10 again, then undef again.

    So to summarize, a /.../g attaches state to a string, and confuses you if it's not reset. In the most common cases such as while (/foo/g) { ... } you always exhaust the matches until there is a failed match, resetting pos at the end and not causing confusion.

    In Perl 6 this is avoided by storing the match position inside the match object, not associated with the string

      That reminds me of things I knew, vaguely, and ties them together into an explanation I understand. Enlightenment attained!! Eureka!! Thanks moritz.

Re: strange behavior of grep with global match
by Marshall (Canon) on Aug 07, 2009 at 09:05 UTC
    not sure what you want, but remove the /g on /exception:/g. Also not sure what you mean with !/\Gtex/.
    #!/usr/bin/perl -w use strict; use Data::Dumper; my @strings = qw(exception:tex exception:mex exception:mex asdf tex:exception:mex asdf exception:mex asdf asdf exception:mex); foreach (1..4) { my @filtered = grep { /exception:/ && !/mex/ } @strings; print Dumper(\@filtered); } __END__ prints: $VAR1 = [ 'exception:tex' ]; $VAR1 = [ 'exception:tex' ]; $VAR1 = [ 'exception:tex' ]; $VAR1 = [ 'exception:tex' ];
    if you just need the exception lines:
    #!/usr/bin/perl -w use strict; use Data::Dumper; my @strings = qw(exception:tex exception:mex exception:mex asdf tex:exception:mex asdf exception:mex asdf asdf exception:mex); foreach (1..4) { my @filtered = grep { /exception:/ } @strings; print Dumper(\@filtered); } __END__ prints: $VAR1 = [ 'exception:tex', 'exception:mex', 'exception:mex', 'tex:exception:mex', 'exception:mex', 'exception:mex' ]; ...3 more of same

      Thanks Marshall.

      Sorry for not providing more context. I am trying to match 'excpetion:' not followed by 'tex' without using a negative lookahead assertion. See Match with line without a word for the motivation/context. I have alternatives that produce the desired result. At the moment I am trying to understand the behavior of this particular construct, which I don't understand.

      I am also interested to learn better/alternative ways of achieving the objective. I would be interested in solutions that exclude strings with 'tex' immediately after 'exception:' and those that exclude 'tex' anywhere after 'exceptions:'. In either case, strings with 'tex' before and not after 'exceptions:' should not be excluded. This is why /exception:/ && !/tex/, though simple, isn't a solution.

        and those that exclude 'tex' anywhere after 'exceptions:'.

        You can achieve that easily with adding .* (or maybe (?s:.)* before the tex, so either (?!.*tex) or /exception:/g && !/\G.*tex/s (after which you have to reset pos, as explained in my reply below).

        I would be interested in solutions that exclude strings with 'tex' immediately after 'exception:'
        /exception:(?:[^t]|t[^e]|te[^x])/
        The more detail we know about the problem, the great the likely hood of success! I tried again with a couple of simple approaches...shown below..

        Approach #1: All solutions must have the sequence of "tex" followed by "exception" or "exceptions". So the first grep (read these grep "stacks" from the bottom up), takes care of that situation. Then the next grep says if "tex" occurs more than once, then this is a bad line. I don't know if "tex:tex:exception" occurs or not? If so then this approach would filter that out. But if 'tex' only can occur once then this works fine.

        Approach #2: Starts the same as a Approach #1, but the second grep says that if "tex" follows exception(s) then this is a "bad line" (remember the first grep{} assured that we are looking at a line that has a "tex..blah..exceptions", here we are looking to see if some 'tex' follows that exception part, and if so filter it out.

        I don't see the need for any fancy look ahead/behind voodoo. Yes, there is a place and a situation for that, but I would go with something simple to understand. If this doesn't do what you want, then modify the @data and the #desired is comment section to more accurately describe what you need.

        #!/usr/bin/perl -w use strict; my @data = ('exception:mex', 'qwerty', 'tex:exception:mex', 'exception:mex', 'exception:tex', 'tex:exception', 'tex:exception:tex', 'tex : exceptions:mex', 'tex:exception:mex:tex', 'asdf'); #desired is: tex:exception:mex # tex:exception # tex : exceptions:mex my @matches = grep{ #approach #1 my @texes = m/tex/g; @texes <=1 } grep{/tex.*?exception(s)?/} @data; print join ("\n",@matches),"\n\n"; @matches = grep{ !/exception(s)?.*?tex/} #approach #2 grep{/tex.*?exception(s)?/} @data; print join ("\n",@matches),"\n"; __END__ Prints: tex:exception:mex tex:exception tex : exceptions:mex tex:exception:mex tex:exception tex : exceptions:mex
Re: strange behavior of grep with global match
by Anonymous Monk on Aug 07, 2009 at 09:20 UTC
    use re 'debug';

      Beautiful!! I'm hoping enlightenment will follow once my mind gets over being boggled. Can you point me to some documentation that explains what the output means?

Re: strange behavior of grep with global match
by jethro (Monsignor) on Aug 07, 2009 at 09:43 UTC

    Quote perlop:

    In scalar context, each execution of "m//g" finds the next match, returning true if it matches, and false if there is no further match. The position after the last match can be read or set using the pos() function; see "pos" in perlfunc. A failed match normally resets the search position to the beginning of the string, but you can avoid that by adding the "/c" modifier (e.g. "m//gc"). Modifying the target string also resets the search position.

    So your regex was doing a search in each string and remembering always were it left off. With the first call it found 'exception:' in the string, with the second it reached the end of the string because there is no second "expection:" in any of the strings. Try this:

    my @strings = qw(exception:texexception: exception:mex ... *** produces: $VAR1 = [ 'exception:texexception:', 'exception:mex', 'exception:mex', 'exception:mex', 'exception:mex' ]; $VAR1 = [ 'exception:texexception:' ]; $VAR1 = [ 'exception:mex', 'exception:mex', 'exception:mex', 'exception:mex' ]; $VAR1 = [ 'exception:texexception:' ];

    You see that the second "expection:" is found in the odd calls

    To inspect behaviour, you could call perl with the parameter -Dr which provides extensive regex debugging

      Thanks for the pointer to -Dr. I guess this is different from use re 'debug' as the latter works but I will have to recompile perl with -DDEBUGGING to try -Dr.

      Pointers to any documentation that will help me understand the (I assume) non-trivial output that will be produced would be greatly appreciated. I know of perlreguts, though I'm not very familiar with it and will have to look again.

      When the only tool you know is gdb everything looks like an object module, and your head hurts

        There seems to be not much difference between the output of the two. The advantage of re debug is that you can turn it on and off (with "no re 'debug';") inside the script to limit the output you are getting.

        Here is a diff between the output of the two debug methods on your program (without warnings,data::dumper, strict, as these produce output too):

        1a2 > rarest char : at 9 6a8 > rarest char x at 2 11a14,17 > Omitting $` $& $' support. > > EXECUTING... >

        As you can see there are only 4 non-empty lines different between the two