in reply to Regex to match text in broken parens
I really don't mind using more than one regex for this one. You're dealing with more than one rule, so there's a nice symmetry; each rule has corresponding code. If you are concerned with it being verbose where you want it to be terse, move the work out to a subroutine.
Anyway, with those ideas, here's my version:
use Test::More; my @test = ( [ '1 This (is a test) with good parens' => 'is a test', 'Match in parens' ], [ '2 This is a (test with broken a paren' => 'test with broken a par +en', 'Match after left paren' ], [ '3 And this would be one) the other way' => '3 And this would be o +ne', 'Match before right paren' ], [ '4 Lastly, no parens' => '', 'No match' ], ); foreach my $test (@test) { my $got = match( $test->[0] ); is( $got, $test->[1], "$test->[2]: <<$got>>" ); } done_testing(); sub match { for (shift) { m/ \(([^)]*?)\) /x && return $1; # Both parens. m/ \((.*)$ /x && return $1; # Left paren. m/ ^(.*)\) /x && return $1; # Right paren. m/ ^[^()]*()$ /x && return $1; # No parens (no capture). return; # Unreachable. } }
Update: As often happens, I just have to go to bed to have an idea disturb me. Here's an improvement (I think) on sub match:
sub match { local $_ = shift; m/ \(([^)]*?)\) /x # Both parens. || m/ \((.*)$ /x # Left paren. || m/ ^(.*)\) /x # Right paren. || m/ ^[^()]*()$ /x; # No parens (no capture). return $1 // (); }
Here's another version that combines the logic above into a single regex using alternation. I don't necessarily think this is better; I prefer the simplicity of breaking things into smaller regexes.
sub match { shift =~ m/ (?: [^(]*\((?<C>[^)]*?)\) ) # Both parens. | (?: \((?<C>.*)$ ) # Left paren. | (?: ^(?<C>.*)\) ) # Right paren. | (?: ^[^()]*(?<C>)$ ) # No parens (empty capture +). /x; return $+{C} // (); }
By using named captures we avoid the problem where other single-regex solutions result in either $1, or $2, or $3 being populated. That's too much to keep track of, and could be error prone. Instead, we name every capture the same: $+{C}. (Warning: After checking perlre, I'm of the vague and uncertain impression that this could rely on undefined behavior.)
Update: Having a little fun with this. Here are two more options with subtle changes from the previous.
The next example eliminates named captures. This would present a problem: The numeric match variable that accepts the capture could be $1, $2, or $3. choroba avoids this issue by concatenating all possible numeric match variables, but that means possibly interpolating undef, and feels a little dirty (but it is clever). We can avoid that by using $^N, which will contain the most recent submatch.
sub match { shift =~ m/ (?: [^(]*\(([^)]*?)\) ) # Both parens. | (?: \((.*)$ ) # Left paren. | (?: ^(.*)\) ) # Right paren. | (?: ^[^()]*()$ ) # No parens (empty capture). /x; return $^N // (); }
This next one wraps all the alternation branches in the (?|...) branch reset construct. That means that each alternate will use the same $1, which is actually the closest I can come to the multiple-regex solutions I originally presented, but within a single regex.
sub match { shift =~ m/ (?| (?: [^(]*\(([^)]*?)\) ) # Both parens. | (?: \((.*)$ ) # Left paren. | (?: ^(.*)\) ) # Right paren. | (?: ^[^()]*()$ ) # No parens (empty capture). ) /x; return $1 // (); }
And finally we can remove the grouping (?...) parens, because alternation is already very low precedence:
sub match { shift =~ m/ (?| [^(]*\(([^)]*?)\) # Both parens. | \((.*)$ # Left paren. | ^(.*)\) # Right paren. | ^[^()]*()$ # No parens (empty capture). ) /x; return $1 // (); }
I think that this, being Perl, grants us license to explore in the spirit of There is more than one way to do it. :)
Dave
|
|---|