in reply to Regexes: finding ALL matches (including overlap)

Note: I would want "abcdef" =~ m/..*..*./g to return 20 = 6 choose 3 matches.
You can add a simple counter to your regexes with (?{code}):
local $_ = "abcdef"; my $count; /..*..*.(?{$count++})(?!)/; print "$count matches\n"; ## "20 matches"
How does that fancy regex work? Every time it passes the "normal" part of the regex, it increments the counter, but the final (?!) part makes the overall expression fail and backtrack (back past the (?{code})) to try again. This process only stops when it has exhausted every possible way to match the "normal" part of the regex.

There are some issues though: It's a little messy to reuse this, because to do it programatically requires use re 'eval', and lexicals that get closured inside regexes don't always behave like you think they should. You may have to resort to a symbol-table variable for the counter.

blokhead

Replies are listed 'Best First'.
Re^2: Regexes: finding ALL matches (including overlap)
by nobull (Friar) on Jun 04, 2005 at 09:11 UTC
    It's a little messy to reuse this, because to do it programatically requires use re 'eval'
    No, you can (and should) use qr// to avoid this.
    local our $count; my $inc_count = qr/(?{$count++})/; /..*..*.$inc_count(?!)/;

    Update: local our not my.

      You're right, use re 'eval' is not absolutely required, and I shouldn't have said it like that. But beware! Your example code works fine on just an instance-by-instance basis. But if you want to do this programatically and extensibly, then my warning about closure-ing lexicals applies. It's tricky to make a generic-use sub that does this kind of matching.

      You may be tempted to do the following, but it won't work:

      sub match_all_ways { my ($string, $regex) = @_; my $count; my $incr = qr/(?{$count++})/; $string =~ /(?:$regex)$incr(?!)/; return $count; } print match_all_ways("abcdef", qr/..*..*./); # 20 print match_all_ways("abcdef", qr/..*..*./); # undef
      It's because the qr// object is compiled just once and always refers to the first instance of $count. If you call this sub more than once, you will always get undef.

      You have to do something ugly like this to get around it:

      sub match_all_ways { use vars '$count'; my ($string, $regex) = @_; local $count = 0; my $incr = qr/(?{$count++})/; $string =~ /(?:$regex)$incr(?!)/; return $count; }
      or this
      { my $count; my $incr = qr/(?{$count++})/; sub match_all_ways { my ($string, $regex) = @_; $count = 0; $string =~ /(?:$regex)$incr(?!)/; return $count; } }
      So yes, it can be done programatically without use re 'eval', but it's non-trivial and a little messy ;)

      blokhead

        sub match_all_ways { my ($string, $regex) = @_; my $count; my $incr = qr/(?{$count++})/; $string =~ /(?:$regex)$incr(?!)/; return $count; } print match_all_ways("abcdef", qr/..*..*./); # 20 print match_all_ways("abcdef", qr/..*..*./); # undef

        It's because the qr// object is compiled just once and always refers to the first instance of $count. If you call this sub more than once, you will always get undef.

        I see what you mean by lexicals closured in regexes not behaving as one would expect. I would have expected the second print to produce 40 instead of undef (i.e. I would have expected $count to behave like a C static variable, as is the case for "regular" closures). Is there any way to rationalize the actual behavior without diving too deeply into the Perl internals? (I ask because without some rationalization for such an odd behavior there is little chance I will remember it.)

        the lowliest monk

Re^2: Regexes: finding ALL matches (including overlap)
by kaif (Friar) on Jun 04, 2005 at 04:23 UTC

    Great! This is exactly the code idea I wanted. Are there any other ways without using such a construct (just for the sake of TIMTOWDI)?

    I was always unsure of the level of support of enclosing code within regexen. Do you know what kinds of things can go wrong?

      Do you know what kinds of things can go wrong?

      Backtracking can screw things up:

      my $count; 'ac' =~ / a (?{ $count++ }) b | a (?{ $count++ }) c /x; # 1. Matches 'a' in first branch. # 2. Increments $count to 1. # 3. Fails to match 'b'. # 4. Matches 'a' in second branch. # 5. Increments $count to 2. # 6. Matches 'c'. print("$count\n"); # 2

      The fix is to use local. When the regexp backtracks through a local, the old value is restored. The old value is also restored when the regexp succesfully matches, so you need to save the result.

      my $count; our $c = 0; 'ac' =~ / (?: a (?{ local $c = $c + 1 }) b | a (?{ local $c = $c + 1 }) c ) (?{ $count = $c }) # Save result. /x; # 1. Matches 'a' in first branch. # 2. Increments $c to 1. # 3. Fails to match 'b'. # 4. Undoes increment ($c = 0). # 5. Matches 'a' in second branch. # 6. Increments $c to 1. # 7. Matches 'c'. # 8. $count = $c. print("$count\n"); # 1
Re^2: Regexes: finding ALL matches (including overlap)
by kaif (Friar) on Jun 04, 2005 at 18:00 UTC

    So I just read through perlre and I couldn't find something: how does one include a (code-based) conditional expression in a regex, analogous to actions in P::RD? Is it even possible? If so, then one could not only find the last match (which may differ slightly from reversing the result of a reversed regex):

    "abcdef" =~ /(..*..*.)(?{$last = $^N})(?!)/; print "[$last]\n"; ## "[def]"
    but also the (say) tenth match.

    Another solution to my problem would be possible if P::RD had non-greedy matches. Is it likely that this will be implemented soon? I guess I could try hacking on it myself.

    P.S.: Has anyone ever used customre? Super Search gave back only one result ...