Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

How to surely determine whichever Perl regex group fail (and also succeed if still any) when regex fail e.g:
'okoK'=~ /(o|k)(k|i)(O|K)(K)/

Replies are listed 'Best First'.
Re: Tell or determine whichever Perl regex group fails
by kcott (Archbishop) on Jan 04, 2022 at 07:47 UTC

    Take a look at Regexp::Debugger. I've been using this module for about a decade and can highly recommend it.

    "How to surely determine whichever Perl regex group fail ..."

    It doesn't really work like that. Assuming you've installed Regexp::Debugger, you can run

    $ rxrx -e 'q{okoK} =~ /(o|k)(k|i)(O|K)(K)/'

    and watch:

    1. 1st group (o|k) match success on 1st alternative ($1 = "o")
    2. 2nd group (k|i) match success on 1st alternative ($2 = "k")
    3. 3rd group (O|K) match fail
    4. BACKTRACK
    5. 2nd group (k|i) match fail on 2nd alternative ($2 removed)
    6. BACKTRACK
    7. 1st group (o|k) match fail on 2nd alternative ($1 removed)
    8. ... Regex failed to match after 66 steps ...

    So, as you can see, capture groups can potentially both succeed and fail at different points in the matching process.

    — Ken

Re: Tell or determine whichever Perl regex group fails
by Corion (Patriarch) on Jan 04, 2022 at 09:22 UTC

    I wish for that for a long time, but while a regular expression can definitively match, I haven't found a way to programmatically determine where the match fails. This is because a regular expression will try all alternatives and only after all of them have failed the entire regular expression has failed.

    I think you could be clever and for any regular expression with a fixed length of atoms anchored to the start, you can tell where it fails, but any unanchored regex will simply fail "at the end" as that is the last position where it checks.

    I also use Regexp::Debugger to find where a regexp fails to match. It is unwieldly if you construct your target regexp from variables, but I usually extract the offending regular expression (as string) and the target string into a separate file to check where things go sideways.

Re: Tell or determine whichever Perl regex group fails
by LanX (Saint) on Jan 04, 2022 at 01:30 UTC
    What for? Debugging?

    you can include code behind each group, which is only executed if the former matched.

    DB<7> p 'okoK'=~ /(o|k)(?{print "#1 matched"})(k|i)(O|K)(K)/ #1 matched

    this can become extremely ugly with more complex regexes

    And this is no sure bet, Perl can skip the attempt to match, if heuristics tell that there is no chance to ever succeed.

    DB<12> p 'okoK'=~ /(o|k)(?{print "#1 matched"})(k|i)X(O|K)(K)/ # + no "X", why bother? DB<13>

    So you have to play around with use re "debug" and friends to find the real reason - see re - or use an interactive regex debugger.

    Both require understanding how regexes work internally.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: Tell or determine whichever Perl regex group fails
by bliako (Abbot) on Jan 04, 2022 at 13:27 UTC

    There exists GraphViz::Regex which visualises a regex as a graph.

    My idea was to visualise what happens to that graph when an input string is run against the regex. And then find unvisited or failed nodes. Sorry, that's a rough sketch. Whatever the way, I don't know how to do that. BUT! there is re_graph.pl which claims to not only visualise a regex but also visualise it when run against some input, see the example parsing perl comments. The author is Steve Oualline. Unfortunately I did not manage to get that example to work.

    Update: along the lines of the above rough sketch I found (again I think?) this online regex visualiser https://blog.robertelder.org/regular-expression-visualizer/ which shows that it is visible. Anonymous Monk has a way to conveniently extract all the info from running the regex against some input: Re: Tell or determine whichever Perl regex group fails

    bw, bliako

Re: Tell or determine whichever Perl regex group fails
by ikegami (Patriarch) on Jan 04, 2022 at 15:00 UTC

    They all fail. Eventually, the regex engine will attempt to match at position 4, and not even the first capture will succeed.

      Anchoring the match would change that.

      'okoK'=~ /^(o|k)(k|i)(O|K)(K)/

      You can alter this pattern to give the the desired information.

      'okoK'=~ / ^ (?: (o|k) (?: (k|i) (?: (O|K) (?: (K) )? )? )? )? /x; my $n = length($&); # -or- use List::Util qw( first ); my $n = first { defined($-[$_]) } 1..4;
Re: Tell or determine whichever Perl regex group fails
by Anonymous Monk on Jan 04, 2022 at 16:07 UTC

    Because this is Perl, There Is More Than One Way To Do It. The re module is in core, and will produce a trace of both the compilation and execution of a regular expression. In this case:

    $  perl -Mre=debug -e '"okoK" =~ /(o|k)(k|i)(O|K)(K)/'
    Compiling REx "(o|k)(k|i)(O|K)(K)"
    Final program:
       1: OPEN1 (3)
       3:   TRIE-EXACTko (9)
            <o> 
            <k> 
       9: CLOSE1 (11)
      11: OPEN2 (13)
      13:   TRIE-EXACTik (19)
            <k> 
            <i> 
      19: CLOSE2 (21)
      21: OPEN3 (23)
      23:   TRIE-EXACTKO (29)
            <O> 
            <K> 
      29: CLOSE3 (31)
      31: OPEN4 (33)
      33:   EXACT <K> (35)
      35: CLOSE4 (37)
      37: END (0)
    anchored "K" at 3..3 (checking anchored) stclass AHOCORASICK-EXACTko minlen 4 
    Matching REx "(o|k)(k|i)(O|K)(K)" against "okoK"
    Intuit: trying to determine minimum start position...
      doing 'check' fbm scan, 3..4 gave 3
      Found anchored substr "K" at offset 3 (rx_origin now 0)...
      (multiline anchor test skipped)
    Intuit: Successfully guessed: match at offset 0
       0 <> <okoK>               |   0| 1:OPEN1(3)
       0 <> <okoK>               |   0| 3:TRIE-EXACTko(9)
       0 <> <okoK>               |   0| TRIE: State:    1 Accepted: N TRIE: Charid:  1 CP:  6f After State:    2
       1 <o> <koK>               |   0| TRIE: State:    2 Accepted: Y TRIE: Charid:  0 CP:   0 After State:    0
                                 |   0| TRIE: got 1 possible matches
                                 |   0| TRIE matched word #1, continuing
                                 |   0| TRIE: only one match left, short-circuiting: #1 <o>
       1 <o> <koK>               |   0| 9:CLOSE1(11)
       1 <o> <koK>               |   0| 11:OPEN2(13)
       1 <o> <koK>               |   0| 13:TRIE-EXACTik(19)
       1 <o> <koK>               |   0| TRIE: State:    1 Accepted: N TRIE: Charid:  1 CP:  6b After State:    2
       2 <ok> <oK>               |   0| TRIE: State:    2 Accepted: Y TRIE: Charid:  0 CP:   0 After State:    0
                                 |   0| TRIE: got 1 possible matches
                                 |   0| TRIE matched word #1, continuing
                                 |   0| TRIE: only one match left, short-circuiting: #1 <k>
       2 <ok> <oK>               |   0| 19:CLOSE2(21)
       2 <ok> <oK>               |   0| 21:OPEN3(23)
       2 <ok> <oK>               |   0| 23:TRIE-EXACTKO(29)
                                 |   0| TRIE: failed to match trie start class...
    Match failed
    Freeing REx: "(o|k)(k|i)(O|K)(K)"
    

    If you are using Windows, you will need to use " where my example has ', and vice versa.