comment on

I really don't mind using more than one regex for this one. You're dealing with more than one rule, so there's a nice symmetry; each rule has corresponding code. If you are concerned with it being verbose where you want it to be terse, move the work out to a subroutine.

Anyway, with those ideas, here's my version:

use Test::More;

my @test = (
  [
    '1 This (is a test) with good parens' => 'is a test',
    'Match in parens'
  ],
  [
    '2 This is a (test with broken a paren' => 'test with broken a par
+en',
    'Match after left paren'
  ],
  [
    '3 And this would be one) the other way' => '3 And this would be o
+ne',
    'Match before right paren'
  ],
  [
    '4 Lastly, no parens' => '',
    'No match'
  ],
);

foreach my $test (@test) {
  my $got = match( $test->[0] );
  is( $got, $test->[1], "$test->[2]: <<$got>>" );
}

done_testing();

sub match {
  for (shift) {
    m/  \(([^)]*?)\)  /x && return $1;    # Both parens.
    m/  \((.*)$       /x && return $1;    # Left paren.
    m/  ^(.*)\)       /x && return $1;    # Right paren.
    m/  ^[^()]*()$    /x && return $1;    # No parens (no capture).
    return;                               # Unreachable.
  }
}
[download]

Update: As often happens, I just have to go to bed to have an idea disturb me. Here's an improvement (I think) on sub match:

sub match {
  local $_ = shift;
      m/  \(([^)]*?)\)  /x    # Both parens.
  ||  m/  \((.*)$       /x    # Left paren.
  ||  m/  ^(.*)\)       /x    # Right paren.
  ||  m/  ^[^()]*()$    /x;   # No parens (no capture).
  return $1 // ();
}
[download]

Here's another version that combines the logic above into a single regex using alternation. I don't necessarily think this is better; I prefer the simplicity of breaking things into smaller regexes.

sub match {
  shift =~ m/
       (?:    [^(]*\((?<C>[^)]*?)\)    )    # Both parens.
    |  (?:    \((?<C>.*)$              )    # Left paren.
    |  (?:    ^(?<C>.*)\)              )    # Right paren.
    |  (?:    ^[^()]*(?<C>)$           )    # No parens (empty capture
+).
  /x;
  return $+{C} // ();
}
[download]

By using named captures we avoid the problem where other single-regex solutions result in either $1, or $2, or $3 being populated. That's too much to keep track of, and could be error prone. Instead, we name every capture the same: $+{C}. (Warning: After checking perlre, I'm of the vague and uncertain impression that this could rely on undefined behavior.)

Update: Having a little fun with this. Here are two more options with subtle changes from the previous.

The next example eliminates named captures. This would present a problem: The numeric match variable that accepts the capture could be $1, $2, or $3. choroba avoids this issue by concatenating all possible numeric match variables, but that means possibly interpolating undef, and feels a little dirty (but it is clever). We can avoid that by using $^N, which will contain the most recent submatch.

sub match {
  shift =~ m/
       (?:    [^(]*\(([^)]*?)\)    )    # Both parens.
    |  (?:    \((.*)$              )    # Left paren.
    |  (?:    ^(.*)\)              )    # Right paren.
    |  (?:    ^[^()]*()$           )    # No parens (empty capture).
  /x;
  return $^N // ();
}
[download]

This next one wraps all the alternation branches in the (?|...) branch reset construct. That means that each alternate will use the same $1, which is actually the closest I can come to the multiple-regex solutions I originally presented, but within a single regex.

sub match {
  shift =~ m/
    (?|
         (?:    [^(]*\(([^)]*?)\)    )    # Both parens.
      |  (?:    \((.*)$              )    # Left paren.
      |  (?:    ^(.*)\)              )    # Right paren.
      |  (?:    ^[^()]*()$           )    # No parens (empty capture).
    )
  /x;
  return $1 // ();
}
[download]

And finally we can remove the grouping (?...) parens, because alternation is already very low precedence:

sub match {
  shift =~ m/
    (?|
           [^(]*\(([^)]*?)\)    # Both parens.
      |    \((.*)$              # Left paren.
      |    ^(.*)\)              # Right paren.
      |    ^[^()]*()$           # No parens (empty capture).
    )
  /x;
  return $1 // ();
}
[download]

I think that this, being Perl, grants us license to explore in the spirit of There is more than one way to do it. :)

Dave

In reply to Re: Regex to match text in broken parens by davido
in thread Regex to match text in broken parens by Rodster001

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.