toadi has asked for the wisdom of the Perl Monks concerning the following question:

hello,

I stumbled on a weird regex thingie I can't explain.

use strict; use File::Find; my $path = "/somwhere/to"; find (sub { return unless -f;# only files &followup($File::Find::name); }, $path); sub followup { my $contents = slurp($_[0]); if ( $contents =~ /line3/g ) { if ( $contents =~ /(line2-)(\w*)/g print $2; } } sub slurp { local $/ = undef; local *X; open X, $_[0] or die "Can't open $_[0]: $!"; my $slurp = <X>; close X or die "Can't close $_[0]: $!"; $slurp; }
How the file looks like:

line1-11
line2-12
line3-13

And do u know what the $2 doesn't get printed...
Cos line 2 doesn't get found in me regex. Can someone explain this?



--
My opinions may have changed,
but not the fact that I am right

Replies are listed 'Best First'.
Re: weird regex problem
by davorg (Chancellor) on Jun 14, 2001 at 13:32 UTC

    Well there are quite a few typos in your code which mean that it doesn't compile, but I assume they are just transcription errors.

    Fixing them and running your code, I see that nothing is output as you describe. The problem is in the line

    if ( $contents =~ /(line2-)(\w*)/g ) {

    The /g is unnecessary here and is causing the expression to evaluate as false. Removing the /g makes the code work as expected.

    The /g option matches the regex as often as it can against your string. The final time it tries to match, the match fails and the operator returns false. Without the /g the match only takes place the one time it needs to succeed and the operator returns true.

    Update: Yeah. As others in this thread point out, I had the right fix, but the wrong explanation. Should drink more coffee before posting :)

    --
    <http://www.dave.org.uk>

    Perl Training in the UK <http://www.iterative-software.com>

      Right fix davorg, but wrong explanation :). Here is the relevant code section again:

      if ( $contents =~ /line3/g ) { if ( $contents =~ /(line2-)(\w*)/g ) { print $2; } }

      In $contents is the slurped file:
      line1-11
      line2-12
      line3-13
      The first if matches 'line3' and returns true. The second match picks up at the position where the first match left (because it has also the /g modifier) and fails (!) because line2 is before line3. So removing the /g modifier on the second 'if' solves the problem as the match is now done from the start of $contents. As a matter of fact, the /g modifier can be left out for both matches.

      Some further optimisations I would suggest for this regex:

      • Leave out the capturing brackets for (line2-)
      • Use multiline matching (/m)
      • Compile the patterns only once (/o)
      This then leads to the following code:

      if ( $contents =~ /^line3/mo ) { if ( $contents =~ /^line2-(\w*)/mo ) { print $1; # has to be changed as well } }

      To clarify the /g modifier a little bit further let's take a look at this code (of course see also perlre and perlop):

      my $string = "abcde abcde adcde"; while ($string =~ /cd/og) { print "pos = ", pos $string, "\n"; if ($string =~ /a(.)/ocg) { print "a$1 matched at ",pos $string, "\n"; } }

      Here the first match happens in a while loop, but important still in scalar context. The inner matching starts at the position where the first left off as it also has the /g modifier. Then the outer match takes its turn again starting where the inner left off. The position in the string is only reset when a match fails. This does not happen when the /c modifier is given. This is necessary for the second match in this case - otherwise there is an infinite loop.

      Taking this code and playing a bit with the modifiers and the string helps a lot in understanding these (not so easy) things. And I haven't even started talking about m//g in list context yet ...

      -- Hofmator

Re: weird regex problem
by japhy (Canon) on Jun 14, 2001 at 16:02 UTC
    japhy at YAPC is still vigilant enough to find and answer your regex questions ;)

    The problem is, as davorg pointed out, the /g modifier -- but I don't think he pointed it out the right way. If you match on a string in scalar context with m//g, then the next m//g match on that string (assuming you've not futzed with pos() or modified the string or such) will start looking where the last one left off. That means that this code:
    $_ = "c a"; if (/a/g/ and /c/g) { ... }
    Won't match, since after the "a" is matched, the next regex starts AFTER the "c". Removing the /g will make things work as expected.

    japhy -- Perl and Regex Hacker
      The right solution has been offered. . . but I'm a little puzzled:
      #THIS LINE BELOW! if ( $contents =~ /^line2-(\w*)/mo ) { print $1; # has to be changed as well } }
      Why do you use \w*? First, if all you have is 2-11, 2-12, etc., then a \d is by far better. Second, by having the * you allow this line to be matched: line2-
      I assume you do not wish that to happen, so add \d+ (unless you really need the \w).

      Also, as to Hofmator's optimization, I disagree. The /o will have NO effect, since the variable is not in the regex, but is bound to it. If it were $contents =~ /$line/ then it would be an optimization, but now it has no effect, and only muddies up the waters, so to say. Also, I would speculate that the /m is useless in this situation. The format is:
      line1: /stuff here/
      line2-2: /more stuff/
      .
      .
      .

      Therefore, I assume by what he gave us, that the information appears only at the beginning of the lines, and would not make sense to spread multiple lines; therefore taking away the /m would be better(not to mention /m doesn't optimize it, but slightly detracts from it).

      UPDATE: By the way, to optimize it further, you should add the caret to the regex; /^line1/ if I am correct in that line1 is at the beginning of the line.