regular expression help

Tobin Cataldo has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I am in need of guidance. I am attempting to capture elements from a string which may or may not exist.

Two possible (overly simplistic) strings are below, and I need to capture a, b and c (if they exist).

my $input1 = 'a=1 gibberish b=2 c=3';
my $input2 = 'a=1 gibberish c=3';
[download]

I threw this together with a conditional (?) on the b element. I don't get the results I expected, which probably means my expectations are wrong. Any way to get what I want with just a single pass?

my $input1 = 'a=1 gibberish b=2 c=3';
my $input2 = 'a=1 gibberish c=3';

# normal capture
my $match1 =  '^(?<a>a=(\d)).*?(?<b>b=(\d)).*?(?<c>c=(\d))$';
# conditional capture
my $match2 =  '^(?<a>a=(\d)).*?(?<b>b=(\d))?.*?(?<c>c=(\d))$';

# using $match1

print "\n\nUsing match1 ---> $match1\n\n";
if ($input1 =~ m/$match1/){ print "Input1: a is $+{a} and b is $+{b} a
+nd c is $+{c}\n"; }
else { print "Input1: didn't match\n"; }

if ($input2 =~ m/$match1/){ print "Input2: a is $+{a} and b is $+{b} a
+nd c is $+{c}\n"; }
else { print "Input2: didn't match\n"; }

# using $match2 
# conditional capture on the b element
print "\n\nUsing match2 ---> $match2\n\n";

if ($input1 =~ m/$match2/){ print "Input1: a is $+{a} and b is $+{b} a
+nd c is $+{c}\n"; }
else { print "Input1: didn't match\n"; }

if ($input2 =~ m/$match2/){ print "Input2: a is $+{a} and b is $+{b} a
+nd c is $+{c}\n"; }
else { print "Input2: didn't match\n"; }
[download]

Output :

Using match1 ---> ^(?<a>a=(\d)).*?(?<b>b=(\d)).*?(?<c>c=(\d))$

Input1: a is a=1 and b is b=2 and c is c=3
Input2: didn't match


Using match2 ---> ^(?<a>a=(\d)).*?(?<b>b=(\d))?.*?(?<c>c=(\d))$

Input1: a is a=1 and b is  and c is c=3
Input2: a is a=1 and b is  and c is c=3
[download]

Thanks,
Tobin

Comment on regular expression help Select or Download Code

Replies are listed 'Best First'.
Re: regular expression help by Fletch (Bishop) on Dec 16, 2009 at 18:01 UTC
Unless there's a really compelling reason to do this in a single regex I wouldn't. `my %results; for my $tok ( split /\s+/, $input ) { $results{ $1 } = $2 if $tok =~ /^ (.) = (\d+) $/x; } for my $k ( qw( a b c ) ) { print "$k is ", exists $results{ $k } ? $results{ $k } : 'missing', +"\n"; }` [download] Update: Just to clarify, eschewing explicit iteration to get away with just a single regex doesn't really gain you anything and (arguably) makes your code harder to read/maintain/understand ("OK, it's a for loop" vs "ZOMGWTFBBQ is this `?<a` . . . grah, let me open perlre and try and remember") The cake is a lie. The cake is a lie. The cake is a lie.	[reply] [d/l] [select]
Re: regular expression help by ikegami (Patriarch) on Dec 16, 2009 at 17:59 UTC
`my %data = map split(/=/, $_, 2), grep /^[abc]=/, split ' '; print "$_ is $data{$_}\n" for keys %data;` [download] Or the more flexible: `my %data = /([^=\s]+)=(\S)/g; print "$_ is $data{$_}\n" for grep /^[abc]\z/, keys %data;` [download] Update*: Added alternative.	[reply] [d/l] [select]
Re: regular expression help by jwkrahn (Abbot) on Dec 16, 2009 at 18:09 UTC
`$ perl -le' my @strings = ( "a=1 gibberish b=2 c=3", "a=1 gibberish c=3" ); for ( @strings ) { print; my ( $a, $b, $c ) = / (?= .* \ba = (\S+) )? (?= .* \bb = (\S+) )? +(?= .* \bc = (\S+) )? /x; print "\$a = $a \$b = $b \$c = $c"; } ' a=1 gibberish b=2 c=3 $a = 1 $b = 2 $c = 3 a=1 gibberish c=3 $a = 1 $b = $c = 3` [download]	[reply] [d/l]
Re^2: regular expression help by Tobin Cataldo (Monk) on Dec 16, 2009 at 21:54 UTC
Excellent! `?= Zero-width positive lookahead assertion. ?! Zero-width negative lookahead assertion.` [download] These constructs really add some needed depth to pattern matching. Thanks for the crash course.	[reply] [d/l]
Re: regular expression help by AnomalousMonk (Archbishop) on Dec 16, 2009 at 20:10 UTC
I agree with previous replies that there is no point to using a 'one big regex' solution when the job can be done more clearly, even if with more statements. That said, I could not understand why the `$match2` regex would not extract the `b` named capture from `$input1` string: there is clearly a `'b=2'` substring there to `b` extracted. A little instrumentation made the situation clear. The first `/.?/` lazy expression and the `/(?<b>b=(\d))?/` lazy capture are immediately satisfied. That leaves the second* `/.?/` lazy expression and the final required expression at the end of the line to match. The second `/.?/` tries to match with nothing, but then must backtrack (forward-track?) and consume all text (including the `'b=2'` substring) until the final required expression matches in order to achieve an overall match. This is made clear in the following (note the added named capture around the second `/.?/`): >perl -wMstrict -le "my $input1 = 'a=1 gibberish b=2 c=3'; my $match2 = '^(?<a>a=(\d)).?(?<b>b=(\d))?(?<gib>.?)(?<c>c=(\d))$'; print qq{input1 '$input1'}; print qq{Using regex $match2}; if ($input1 =~ m/$match2/){ print qq{a '$+{a}' b '$+{b}' c '$+{c}' gib '$+{gib}'}; } else { print qq{Input1: didn't match}; } " input1 'a=1 gibberish b=2 c=3' Using regex ^(?<a>a=(\d)).?(?<b>b=(\d))?(?<gib>.*?)(?<c>c=(\d))$ Use of uninitialized value in concatenation (.) or string at ... a 'a=1' b '' c 'c=3' gib ' gibberish b=2 ' [download]	[reply] [d/l] [select]