A recent thread (need regex help to strip things like embedded C comments) discussed the use of regexes to extract nested ``bracketed'' patterns such as nested C block comments (if such a thing existed in C today; some pre-ANSI-standard implementations supported this feature).

The discussion of the (??{ code }) extended pattern in perlre gives an example of such a regex for extracting nested parenthetic pairs:

$re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x;

This example can be extended to handle arbitrary multi-character starting and ending sequences like /* and */.

The perlre example uses the non-backtracking, ``atomic'' extended pattern (?>pattern), but the example seems to work just as well without it for both single- and multi-character starting and ending sequences, as in the following code...

use warnings; use strict; my $open_cmt = qr{\Q/*}xms; # NO SPACES: \Q escapes spaces my $close_cmt = qr{\Q*/}xms; use re 'eval'; our $paired_parens = # CAUTION: MUST be package variable!!!! qr{ # adapted from example (??{ code }) regex from perlre \( (?: # (?> [^()]+ ) # Non-parens without backtracking - works [^()]+ # Non-parens with backtracking - works | (??{ $paired_parens }) # Group with matching parens )* # \) # ignore un-paired paren (?: \) | \z ) # grab un-paired paren to end of string }xms; our $c_comment = # CAUTION: MUST be package variable!!!! qr{ $open_cmt (?: # (?> (?: (?! $open_cmt) (?! $close_cmt) . )+ ) # works # (?> (?: (?! $open_cmt | $close_cmt ) . )+ ) # works (?: (?! $open_cmt | $close_cmt ) . )+ # works | (??{ $c_comment }) # nested comment )* $close_cmt # ignore improperly closed comment # (?: $close_cmt | \z ) # grab un-closed comment to string end }xms; my $result; my $parens = "degenerate examples () (((()))) ((((())))) (simple) parens (nested(with)other) stuff multi-line ( nested (parens () (non-empty) (sequential) ( (multi-line) (sequential) ( foo ((())) ) bar ) ) ) improperly ( paired ( parens )"; ($result = $parens) =~ s{ ($paired_parens) } { # print "captured: <$1> \n"; # FOR DEBUG "PAIR:$1:RIAP"; }egxms; print "$result \n"; my $comments = "/* simple comment on its own line */ various degenerate comments /**/ /*/*/*/*/*/**/*/*/*/*/*/ simplest multi-level comment /*/*/*/*/*/**/*/*/*/*/*/ with other stuff simplest seven-deep /*/*/*/*/*/*/**/*/*/*/*/*/*/ comment two /* sequential */ comments /* on a line */ with other stuff two-deep /* nested /* comments */ on a single */ line three-deep /* nested /* comments /* (level 3) */ */ on single */ line five-deep /* multi-line comment /* with ********* /* sequential *********** /*************** /* comments */ /* near */ /* lowest */ /* level */ /* on */ /* multiple */ /* lines /* (and a fifth level) */ */ */ finish four-deep ****** */ finish three-deep ******* */ finish two-deep */ end complex nested multi-line comment improperly /* nested /* comment */"; ($result = $comments) =~ s{ ($c_comment) } { # print "captured: <$1> \n"; # FOR DEBUG "PAIR:$1:RIAP"; }egxms; print "$result \n";

My question: What is the reason, if any, for using the atomic sub-expression in the original perlre example?


In reply to Useless use of `atomic' regex extended pattern? by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.