Re: Regex to match text in broken parens

I really don't mind using more than one regex for this one. You're dealing with more than one rule, so there's a nice symmetry; each rule has corresponding code. If you are concerned with it being verbose where you want it to be terse, move the work out to a subroutine.

Anyway, with those ideas, here's my version:

use Test::More;

my @test = (
  [
    '1 This (is a test) with good parens' => 'is a test',
    'Match in parens'
  ],
  [
    '2 This is a (test with broken a paren' => 'test with broken a par
+en',
    'Match after left paren'
  ],
  [
    '3 And this would be one) the other way' => '3 And this would be o
+ne',
    'Match before right paren'
  ],
  [
    '4 Lastly, no parens' => '',
    'No match'
  ],
);

foreach my $test (@test) {
  my $got = match( $test->[0] );
  is( $got, $test->[1], "$test->[2]: <<$got>>" );
}

done_testing();

sub match {
  for (shift) {
    m/  \(([^)]*?)\)  /x && return $1;    # Both parens.
    m/  \((.*)$       /x && return $1;    # Left paren.
    m/  ^(.*)\)       /x && return $1;    # Right paren.
    m/  ^[^()]*()$    /x && return $1;    # No parens (no capture).
    return;                               # Unreachable.
  }
}
[download]

Update: As often happens, I just have to go to bed to have an idea disturb me. Here's an improvement (I think) on sub match:

sub match {
  local $_ = shift;
      m/  \(([^)]*?)\)  /x    # Both parens.
  ||  m/  \((.*)$       /x    # Left paren.
  ||  m/  ^(.*)\)       /x    # Right paren.
  ||  m/  ^[^()]*()$    /x;   # No parens (no capture).
  return $1 // ();
}
[download]

Here's another version that combines the logic above into a single regex using alternation. I don't necessarily think this is better; I prefer the simplicity of breaking things into smaller regexes.

sub match {
  shift =~ m/
       (?:    [^(]*\((?<C>[^)]*?)\)    )    # Both parens.
    |  (?:    \((?<C>.*)$              )    # Left paren.
    |  (?:    ^(?<C>.*)\)              )    # Right paren.
    |  (?:    ^[^()]*(?<C>)$           )    # No parens (empty capture
+).
  /x;
  return $+{C} // ();
}
[download]

By using named captures we avoid the problem where other single-regex solutions result in either $1, or $2, or $3 being populated. That's too much to keep track of, and could be error prone. Instead, we name every capture the same: $+{C}. (Warning: After checking perlre, I'm of the vague and uncertain impression that this could rely on undefined behavior.)

Update: Having a little fun with this. Here are two more options with subtle changes from the previous.

The next example eliminates named captures. This would present a problem: The numeric match variable that accepts the capture could be $1, $2, or $3. choroba avoids this issue by concatenating all possible numeric match variables, but that means possibly interpolating undef, and feels a little dirty (but it is clever). We can avoid that by using $^N, which will contain the most recent submatch.

sub match {
  shift =~ m/
       (?:    [^(]*\(([^)]*?)\)    )    # Both parens.
    |  (?:    \((.*)$              )    # Left paren.
    |  (?:    ^(.*)\)              )    # Right paren.
    |  (?:    ^[^()]*()$           )    # No parens (empty capture).
  /x;
  return $^N // ();
}
[download]

This next one wraps all the alternation branches in the (?|...) branch reset construct. That means that each alternate will use the same $1, which is actually the closest I can come to the multiple-regex solutions I originally presented, but within a single regex.

sub match {
  shift =~ m/
    (?|
         (?:    [^(]*\(([^)]*?)\)    )    # Both parens.
      |  (?:    \((.*)$              )    # Left paren.
      |  (?:    ^(.*)\)              )    # Right paren.
      |  (?:    ^[^()]*()$           )    # No parens (empty capture).
    )
  /x;
  return $1 // ();
}
[download]

And finally we can remove the grouping (?...) parens, because alternation is already very low precedence:

sub match {
  shift =~ m/
    (?|
           [^(]*\(([^)]*?)\)    # Both parens.
      |    \((.*)$              # Left paren.
      |    ^(.*)\)              # Right paren.
      |    ^[^()]*()$           # No parens (empty capture).
    )
  /x;
  return $1 // ();
}
[download]

I think that this, being Perl, grants us license to explore in the spirit of There is more than one way to do it. :)

Dave

Comment on Re: Regex to match text in broken parens Select or Download Code