Useless use of `atomic' regex extended pattern?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

A recent thread (need regex help to strip things like embedded C comments) discussed the use of regexes to extract nested ``bracketed'' patterns such as nested C block comments (if such a thing existed in C today; some pre-ANSI-standard implementations supported this feature).

The discussion of the (??{ code }) extended pattern in perlre gives an example of such a regex for extracting nested parenthetic pairs:

  $re = qr{
             \(
             (?:
                (?> [^()]+ )    # Non-parens without backtracking
              |
                (??{ $re })     # Group with matching parens
             )*
             \)
          }x;
[download]

This example can be extended to handle arbitrary multi-character starting and ending sequences like /* and */.

The perlre example uses the non-backtracking, ``atomic'' extended pattern (?>pattern), but the example seems to work just as well without it for both single- and multi-character starting and ending sequences, as in the following code...

use warnings;
use strict;


my $open_cmt  = qr{\Q/*}xms;  # NO SPACES: \Q escapes spaces
my $close_cmt = qr{\Q*/}xms;

use re 'eval';

our $paired_parens =  # CAUTION: MUST be package variable!!!!
    qr{ # adapted from example (??{ code }) regex from perlre
        \(
        (?:
           # (?> [^()]+ )    # Non-parens without backtracking - works
           [^()]+            # Non-parens with backtracking - works
         |
           (??{ $paired_parens })  # Group with matching parens
        )*
        # \)             # ignore un-paired paren
        (?: \) | \z )  # grab un-paired paren to end of string
      }xms;

our $c_comment =  # CAUTION: MUST be package variable!!!!
    qr{
      $open_cmt
      (?:
        # (?> (?: (?! $open_cmt) (?! $close_cmt) . )+ )  # works
        # (?> (?: (?! $open_cmt | $close_cmt )   . )+ )  # works
        (?: (?! $open_cmt | $close_cmt ) . )+            # works
        |
        (??{ $c_comment })   # nested comment
      )*
      $close_cmt             # ignore improperly closed comment
      # (?: $close_cmt | \z )  # grab un-closed comment to string end
      }xms;


my $result;

my $parens =
"degenerate examples
()
(((())))
((((()))))
(simple)
parens (nested(with)other) stuff
multi-line ( nested
    (parens
    () (non-empty) (sequential)
        (  (multi-line)
           (sequential)
           ( foo
              ((()))
           ) bar
        ) )
    )
improperly ( paired ( parens )";

($result = $parens) =~
    s{ ($paired_parens) }
     { # print "captured: <$1> \n";  # FOR DEBUG
       "PAIR:$1:RIAP";
     }egxms;

print "$result \n";


my $comments =
"/* simple comment on its own line */
various degenerate comments
/**/
/*/*/*/*/*/**/*/*/*/*/*/
simplest multi-level comment /*/*/*/*/*/**/*/*/*/*/*/ with other stuff
simplest seven-deep /*/*/*/*/*/*/**/*/*/*/*/*/*/ comment
two /* sequential */ comments /* on a line */ with other stuff
two-deep /* nested /* comments */ on a single */ line
three-deep /* nested /* comments /* (level 3) */ */ on single */ line
five-deep
    /* multi-line comment
        /* with *********
            /* sequential ***********
                /***************
                    /* comments */ /* near */ /* lowest */ /* level */
                    /* on */
                    /* multiple */
                    /* lines /* (and a fifth level) */ */
                */ finish four-deep ******
            */ finish three-deep *******
        */ finish two-deep
    */
end complex nested multi-line comment
improperly /* nested /* comment */";

($result = $comments) =~
    s{ ($c_comment) }
     { # print "captured: <$1> \n";  # FOR DEBUG
       "PAIR:$1:RIAP";
     }egxms;

print "$result \n";
[download]

My question: What is the reason, if any, for using the atomic sub-expression in the original perlre example?

Comment on Useless use of `atomic' regex extended pattern? Select or Download Code

Replies are listed 'Best First'.
Re: Useless use of `atomic' regex extended pattern? by moritz (Cardinal) on Jul 26, 2007 at 09:06 UTC
Unless I'm very much mistaken, the point is simply efficiency. To stay in the parens example: if a `[^()]+` subregex matches, and the subregex after that fails, the regex engine will backtrack. But you know that it doesn't have to, so in case of a non-matching string the explicitly non-backtracking pattern will be faster. I don't have a perl here, but if you want to verify (or falsify) my statement, try to create a large string that doesn't match the pattern, and benchmark both the backtracking and non-backtracking example. Perl 6 in German	[reply] [d/l]
Re: Useless use of `atomic' regex extended pattern? by ikegami (Patriarch) on Jul 26, 2007 at 15:17 UTC
Effeciency (in the case where there's no match). Imagine trying to match against the string `"(abc"`. `Without the (?>...) ------------------- (abc<fail> (ab<fail> (a<fail> (<fail> <fail>` [download] `With the (?>...) ---------------- (abc<fail> (<fail> <fail>` [download] It's safe to do so here since `[^()]` and `(??{ $re })` are mutally exclusive.	[reply] [d/l] [select]