comment on

A recent thread (need regex help to strip things like embedded C comments) discussed the use of regexes to extract nested ``bracketed'' patterns such as nested C block comments (if such a thing existed in C today; some pre-ANSI-standard implementations supported this feature).

The discussion of the (??{ code }) extended pattern in perlre gives an example of such a regex for extracting nested parenthetic pairs:

  $re = qr{
             \(
             (?:
                (?> [^()]+ )    # Non-parens without backtracking
              |
                (??{ $re })     # Group with matching parens
             )*
             \)
          }x;
[download]

This example can be extended to handle arbitrary multi-character starting and ending sequences like /* and */.

The perlre example uses the non-backtracking, ``atomic'' extended pattern (?>pattern), but the example seems to work just as well without it for both single- and multi-character starting and ending sequences, as in the following code...

use warnings;
use strict;


my $open_cmt  = qr{\Q/*}xms;  # NO SPACES: \Q escapes spaces
my $close_cmt = qr{\Q*/}xms;

use re 'eval';

our $paired_parens =  # CAUTION: MUST be package variable!!!!
    qr{ # adapted from example (??{ code }) regex from perlre
        \(
        (?:
           # (?> [^()]+ )    # Non-parens without backtracking - works
           [^()]+            # Non-parens with backtracking - works
         |
           (??{ $paired_parens })  # Group with matching parens
        )*
        # \)             # ignore un-paired paren
        (?: \) | \z )  # grab un-paired paren to end of string
      }xms;

our $c_comment =  # CAUTION: MUST be package variable!!!!
    qr{
      $open_cmt
      (?:
        # (?> (?: (?! $open_cmt) (?! $close_cmt) . )+ )  # works
        # (?> (?: (?! $open_cmt | $close_cmt )   . )+ )  # works
        (?: (?! $open_cmt | $close_cmt ) . )+            # works
        |
        (??{ $c_comment })   # nested comment
      )*
      $close_cmt             # ignore improperly closed comment
      # (?: $close_cmt | \z )  # grab un-closed comment to string end
      }xms;


my $result;

my $parens =
"degenerate examples
()
(((())))
((((()))))
(simple)
parens (nested(with)other) stuff
multi-line ( nested
    (parens
    () (non-empty) (sequential)
        (  (multi-line)
           (sequential)
           ( foo
              ((()))
           ) bar
        ) )
    )
improperly ( paired ( parens )";

($result = $parens) =~
    s{ ($paired_parens) }
     { # print "captured: <$1> \n";  # FOR DEBUG
       "PAIR:$1:RIAP";
     }egxms;

print "$result \n";


my $comments =
"/* simple comment on its own line */
various degenerate comments
/**/
/*/*/*/*/*/**/*/*/*/*/*/
simplest multi-level comment /*/*/*/*/*/**/*/*/*/*/*/ with other stuff
simplest seven-deep /*/*/*/*/*/*/**/*/*/*/*/*/*/ comment
two /* sequential */ comments /* on a line */ with other stuff
two-deep /* nested /* comments */ on a single */ line
three-deep /* nested /* comments /* (level 3) */ */ on single */ line
five-deep
    /* multi-line comment
        /* with *********
            /* sequential ***********
                /***************
                    /* comments */ /* near */ /* lowest */ /* level */
                    /* on */
                    /* multiple */
                    /* lines /* (and a fifth level) */ */
                */ finish four-deep ******
            */ finish three-deep *******
        */ finish two-deep
    */
end complex nested multi-line comment
improperly /* nested /* comment */";

($result = $comments) =~
    s{ ($c_comment) }
     { # print "captured: <$1> \n";  # FOR DEBUG
       "PAIR:$1:RIAP";
     }egxms;

print "$result \n";
[download]

My question: What is the reason, if any, for using the atomic sub-expression in the original perlre example?

In reply to Useless use of `atomic' regex extended pattern? by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.