Tobin Cataldo has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I am in need of guidance. I am attempting to capture elements from a string which may or may not exist.

Two possible (overly simplistic) strings are below, and I need to capture a, b and c (if they exist).

my $input1 = 'a=1 gibberish b=2 c=3'; my $input2 = 'a=1 gibberish c=3';

I threw this together with a conditional (?) on the b element. I don't get the results I expected, which probably means my expectations are wrong. Any way to get what I want with just a single pass?

my $input1 = 'a=1 gibberish b=2 c=3'; my $input2 = 'a=1 gibberish c=3'; # normal capture my $match1 = '^(?<a>a=(\d)).*?(?<b>b=(\d)).*?(?<c>c=(\d))$'; # conditional capture my $match2 = '^(?<a>a=(\d)).*?(?<b>b=(\d))?.*?(?<c>c=(\d))$'; # using $match1 print "\n\nUsing match1 ---> $match1\n\n"; if ($input1 =~ m/$match1/){ print "Input1: a is $+{a} and b is $+{b} a +nd c is $+{c}\n"; } else { print "Input1: didn't match\n"; } if ($input2 =~ m/$match1/){ print "Input2: a is $+{a} and b is $+{b} a +nd c is $+{c}\n"; } else { print "Input2: didn't match\n"; } # using $match2 # conditional capture on the b element print "\n\nUsing match2 ---> $match2\n\n"; if ($input1 =~ m/$match2/){ print "Input1: a is $+{a} and b is $+{b} a +nd c is $+{c}\n"; } else { print "Input1: didn't match\n"; } if ($input2 =~ m/$match2/){ print "Input2: a is $+{a} and b is $+{b} a +nd c is $+{c}\n"; } else { print "Input2: didn't match\n"; }

Output :

Using match1 ---> ^(?<a>a=(\d)).*?(?<b>b=(\d)).*?(?<c>c=(\d))$ Input1: a is a=1 and b is b=2 and c is c=3 Input2: didn't match Using match2 ---> ^(?<a>a=(\d)).*?(?<b>b=(\d))?.*?(?<c>c=(\d))$ Input1: a is a=1 and b is and c is c=3 Input2: a is a=1 and b is and c is c=3

Thanks,
Tobin

Replies are listed 'Best First'.
Re: regular expression help
by Fletch (Bishop) on Dec 16, 2009 at 18:01 UTC

    Unless there's a really compelling reason to do this in a single regex I wouldn't.

    my %results; for my $tok ( split /\s+/, $input ) { $results{ $1 } = $2 if $tok =~ /^ (.) = (\d+) $/x; } for my $k ( qw( a b c ) ) { print "$k is ", exists $results{ $k } ? $results{ $k } : 'missing', +"\n"; }

    Update: Just to clarify, eschewing explicit iteration to get away with just a single regex doesn't really gain you anything and (arguably) makes your code harder to read/maintain/understand ("OK, it's a for loop" vs "ZOMGWTFBBQ is this ?<a . . . grah, let me open perlre and try and remember")

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: regular expression help
by ikegami (Patriarch) on Dec 16, 2009 at 17:59 UTC
    my %data = map split(/=/, $_, 2), grep /^[abc]=/, split ' '; print "$_ is $data{$_}\n" for keys %data;
    Or the more flexible:
    my %data = /([^=\s]+)=(\S*)/g; print "$_ is $data{$_}\n" for grep /^[abc]\z/, keys %data;

    Update: Added alternative.

Re: regular expression help
by jwkrahn (Abbot) on Dec 16, 2009 at 18:09 UTC
    $ perl -le' my @strings = ( "a=1 gibberish b=2 c=3", "a=1 gibberish c=3" ); for ( @strings ) { print; my ( $a, $b, $c ) = / (?= .* \ba = (\S+) )? (?= .* \bb = (\S+) )? +(?= .* \bc = (\S+) )? /x; print "\$a = $a \$b = $b \$c = $c"; } ' a=1 gibberish b=2 c=3 $a = 1 $b = 2 $c = 3 a=1 gibberish c=3 $a = 1 $b = $c = 3

      Excellent!

      ?= Zero-width positive lookahead assertion. ?! Zero-width negative lookahead assertion.

      These constructs really add some needed depth to pattern matching.

      Thanks for the crash course.

Re: regular expression help
by AnomalousMonk (Archbishop) on Dec 16, 2009 at 20:10 UTC

    I agree with previous replies that there is no point to using a 'one big regex' solution when the job can be done more clearly, even if with more statements.

    That said, I could not understand why the  $match2 regex would not extract the  b named capture from  $input1 string: there is clearly a  'b=2' substring there to  b extracted.

    A little instrumentation made the situation clear. The first  /.*?/ lazy expression and the  /(?<b>b=(\d))?/ lazy capture are immediately satisfied. That leaves the second  /.*?/ lazy expression and the final required expression at the end of the line to match. The second  /.*?/ tries to match with nothing, but then must backtrack (forward-track?) and consume all text (including the  'b=2' substring) until the final required expression matches in order to achieve an overall match.

    This is made clear in the following (note the added named capture around the second  /.*?/):

    >perl -wMstrict -le "my $input1 = 'a=1 gibberish b=2 c=3'; my $match2 = '^(?<a>a=(\d)).*?(?<b>b=(\d))?(?<gib>.*?)(?<c>c=(\d))$'; print qq{input1 '$input1'}; print qq{Using regex $match2}; if ($input1 =~ m/$match2/){ print qq{a '$+{a}' b '$+{b}' c '$+{c}' gib '$+{gib}'}; } else { print qq{Input1: didn't match}; } " input1 'a=1 gibberish b=2 c=3' Using regex ^(?<a>a=(\d)).*?(?<b>b=(\d))?(?<gib>.*?)(?<c>c=(\d))$ Use of uninitialized value in concatenation (.) or string at ... a 'a=1' b '' c 'c=3' gib ' gibberish b=2 '