in reply to Need a Regular Expression that tests for words in different order and captures the values found.

Use a zero-width positive lookahead assertion.

$string = "This is barney rubble and his friends joe rockhead and fred + flintstone"; $string =~ /(?=.*fred (\w+))?(?=.*barney (\w+))?(?=.*joe (\w+))?/; $company = $1 . '_' . $2 . '_' . $3 . '_' . 'inc'; print "$company\n" # "flintstone_rubble_rockhead_inc"

This prints "flintstone_rubble_rockhead_inc". It doesn't fail if one or more names are missing, and keeps the order of your captures -- that is, the word following barney is always $2 (if barney's there), even if fred is missing.

$string = "This is bLarney rubble and his friends joe rockhead and fre +d flintstone"; $string =~ /(?=.*fred (\w+))?(?=.*barney (\w+))?(?=.*joe (\w+))?/; $company = $1 . '_' . $2 . '_' . $3 . '_' . 'inc'; print "$company\n" # "flintstone__rockhead_inc"
  • Comment on Re: Need a Regular Expression that tests for words in different order and captures the values found.
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: Need a Regular Expression that tests for words in different order and captures the values found.
by AnomalousMonk (Archbishop) on Jan 15, 2010 at 17:36 UTC
    It doesn't fail if one or more names are missing...

    But isn't this a bug rather than a feature? Is there any point to matching on a string that contains none of the target substrings (and then interpolates a bunch of undefined values)?

    Also, as pointed out by other respondents,  (?=.*fred (\w+)) will match  'alfred the great' (capturing  'the') in addition to  'fred flintstone'.

      Hmmmm....good point. Well, you can fix part of it with (?=.*\bfred\b (\w+)), but avoiding matching none of the targets is still a problem. The way I suggested will always get three matches because it says find "zero or more of this". That means either you can't actually find out how many non-empty matches you got, or you can only match when all the targets are present. Separate matches in a loop, as suggested by several others, is the way to go. Here's my take, redux:

      $string = "This is bLarney rubble and his friends joe rockhead and fre +d flintstone"; $count = 0; for $target (qw(fred barney joe)) { if ( $string =~ /(?=.*\b$target (\w+))/i ) { push @elements, $1; $count++; } else { push @elements, ''; # as a placeholder } } if ($count >= 2) { print join('_', @elements), "_inc\n" } else { print "Didn't find at least 2 elements in the strin +g\n" } # prints flintstone__blockhead_inc # change 'joe' to 'moe' and you get > Didn't find at least 2 elements +in the string

      There ought to be something useful in there. :-)

      --marmot

        I missed the '2 or 3' requirement on first reading of the OP. For the sake of maintainability if nothing else, a looping (loopy?) approach may, as you say, be the way to go.

        However, there is a simple way to deal with the undefined values produced by zero-quantified captures:

        >perl -wMstrict -le "my $bound = qr{ (?<! [\w-]) }xms; my $A = qr{ (?= .* $bound A \s+ (\w+)) }xms; my $B = qr{ (?= .* $bound B \s+ (\w+)) }xms; my $C = qr{ (?= .* $bound C \s+ (\w+)) }xms; my $extract = qr{ \A $A? $B? $C? }xms; print '-------------------'; for my $line (@ARGV) { my $s = join '_', my @got = grep defined, $line =~ $extract; $s = 'no match' if @got < 2 or @got > 3; print qq{'$line': '$s'}; } " "B Bee C Cee A Aye" "foo C Cee bar A Aye baz B Bee zzz" "A Aye B Bee +" "C Cee foo B Bee" "xxx C Chuck yyyy A Able zzz" "A Aye A Aye B Bee" foo "A Aye" "A Aye pseudo-B Bee" "A Aye XYZB Bee" "A Aye A Aye A A +ye" ------------------- 'B Bee C Cee A Aye': 'Aye_Bee_Cee' 'foo C Cee bar A Aye baz B Bee zzz': 'Aye_Bee_Cee' 'A Aye B Bee': 'Aye_Bee' 'C Cee foo B Bee': 'Bee_Cee' 'xxx C Chuck yyyy A Able zzz': 'Able_Chuck' 'A Aye A Aye B Bee': 'Aye_Bee' 'foo': 'no match' 'A Aye': 'no match' 'A Aye pseudo-B Bee': 'no match' 'A Aye XYZB Bee': 'no match' 'A Aye A Aye A Aye': 'no match'