Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

A recent thread (need regex help to strip things like embedded C comments) discussed the use of regexes to extract nested ``bracketed'' patterns such as nested C block comments (if such a thing existed in C today; some pre-ANSI-standard implementations supported this feature).

The discussion of the (??{ code }) extended pattern in perlre gives an example of such a regex for extracting nested parenthetic pairs:

$re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x;

This example can be extended to handle arbitrary multi-character starting and ending sequences like /* and */.

The perlre example uses the non-backtracking, ``atomic'' extended pattern (?>pattern), but the example seems to work just as well without it for both single- and multi-character starting and ending sequences, as in the following code...

use warnings; use strict; my $open_cmt = qr{\Q/*}xms; # NO SPACES: \Q escapes spaces my $close_cmt = qr{\Q*/}xms; use re 'eval'; our $paired_parens = # CAUTION: MUST be package variable!!!! qr{ # adapted from example (??{ code }) regex from perlre \( (?: # (?> [^()]+ ) # Non-parens without backtracking - works [^()]+ # Non-parens with backtracking - works | (??{ $paired_parens }) # Group with matching parens )* # \) # ignore un-paired paren (?: \) | \z ) # grab un-paired paren to end of string }xms; our $c_comment = # CAUTION: MUST be package variable!!!! qr{ $open_cmt (?: # (?> (?: (?! $open_cmt) (?! $close_cmt) . )+ ) # works # (?> (?: (?! $open_cmt | $close_cmt ) . )+ ) # works (?: (?! $open_cmt | $close_cmt ) . )+ # works | (??{ $c_comment }) # nested comment )* $close_cmt # ignore improperly closed comment # (?: $close_cmt | \z ) # grab un-closed comment to string end }xms; my $result; my $parens = "degenerate examples () (((()))) ((((())))) (simple) parens (nested(with)other) stuff multi-line ( nested (parens () (non-empty) (sequential) ( (multi-line) (sequential) ( foo ((())) ) bar ) ) ) improperly ( paired ( parens )"; ($result = $parens) =~ s{ ($paired_parens) } { # print "captured: <$1> \n"; # FOR DEBUG "PAIR:$1:RIAP"; }egxms; print "$result \n"; my $comments = "/* simple comment on its own line */ various degenerate comments /**/ /*/*/*/*/*/**/*/*/*/*/*/ simplest multi-level comment /*/*/*/*/*/**/*/*/*/*/*/ with other stuff simplest seven-deep /*/*/*/*/*/*/**/*/*/*/*/*/*/ comment two /* sequential */ comments /* on a line */ with other stuff two-deep /* nested /* comments */ on a single */ line three-deep /* nested /* comments /* (level 3) */ */ on single */ line five-deep /* multi-line comment /* with ********* /* sequential *********** /*************** /* comments */ /* near */ /* lowest */ /* level */ /* on */ /* multiple */ /* lines /* (and a fifth level) */ */ */ finish four-deep ****** */ finish three-deep ******* */ finish two-deep */ end complex nested multi-line comment improperly /* nested /* comment */"; ($result = $comments) =~ s{ ($c_comment) } { # print "captured: <$1> \n"; # FOR DEBUG "PAIR:$1:RIAP"; }egxms; print "$result \n";

My question: What is the reason, if any, for using the atomic sub-expression in the original perlre example?

Replies are listed 'Best First'.
Re: Useless use of `atomic' regex extended pattern?
by moritz (Cardinal) on Jul 26, 2007 at 09:06 UTC
    Unless I'm very much mistaken, the point is simply efficiency.

    To stay in the parens example: if a [^()]+ subregex matches, and the subregex after that fails, the regex engine will backtrack.

    But you know that it doesn't have to, so in case of a non-matching string the explicitly non-backtracking pattern will be faster.

    I don't have a perl here, but if you want to verify (or falsify) my statement, try to create a large string that doesn't match the pattern, and benchmark both the backtracking and non-backtracking example.

Re: Useless use of `atomic' regex extended pattern?
by ikegami (Patriarch) on Jul 26, 2007 at 15:17 UTC

    Effeciency (in the case where there's no match).

    Imagine trying to match against the string "(abc".

    Without the (?>...) ------------------- (abc<fail> (ab<fail> (a<fail> (<fail> <fail>
    With the (?>...) ---------------- (abc<fail> (<fail> <fail>

    It's safe to do so here since [^()] and (??{ $re }) are mutally exclusive.