greatwazzoo has asked for the wisdom of the Perl Monks concerning the following question:

I'm using Perl 5.8 and I need to find a regular expression that finds 2 or 3 words on the same
line (no matter what the order) and captures the values beside the matching words.

For example my file would have several lines, but I'm trying to parse the file for lines
containing "Fred Flintstone" "Barney Rubble" "Joe Rockhead",
open $file_fh, "$file" or die "unable to open $file for read : $!"; @lines = <$file_fh> ; foreach $line(@lines) { if ( $line =~ /fred\s+(\w+)\s+barney\s+(\w+)\s+joe\s+(\w+)/i ) { + $company = $1 . "_" . $2 . "_" . $3 . "_" . "inc"; # this would assign "flintstone_rubble_rockhead_inc" to $company # but how could I come up with the same string, if the 3 names were in + the six possible sequences? # 123, 132, 213, 231, 312, 321 etc,.. # so that it would still assign the word following "fred" as the first + capture, the word # following "barney" as the 2nd capture, and the word following "joe" +as the 3rd capture, # now matter what order the names are on the line.
Thanks
  • Comment on Need a Regular Expression that tests for words in different order and captures the values found.
  • Download Code

Replies are listed 'Best First'.
Re: Need a Regular Expression that tests for words in different order and captures the values found.
by ikegami (Patriarch) on Jan 15, 2010 at 06:51 UTC
    my %words; ++$words{$_} for /\w+/g; for ( [qw( Fred Flinstone )], [qw( Barney Rubble )], [qw( Joe Rockhead )], ) { next if @$_ != grep $words{$_}, @$_; print("matched @$_\n"); }

    Update: Oops, missed "captures the values beside the matching words". Fixed:

    my %next_words; $next_words{$1} = $2; while /(\w+)\s+(?=\w+)/g; for ( [qw( Fred Flinstone )], [qw( Barney Rubble )], [qw( Joe Rockhead )], ) { next if @$_ != grep defined($next_words{$_}), @$_; print(join(' ', map "$_ $next_words{$_}", @$_), "\n"); }
Re: Need a Regular Expression that tests for words in different order and captures the values found.
by jwkrahn (Abbot) on Jan 15, 2010 at 07:11 UTC
    open my $file_fh, '<', $file or die "unable to open $file for read : $ +!"; my $pattern = qr/ (?= .* fred \s+ (\w+) ) (?= .* barney \s+ (\w+) ) (?= .* joe \s+ (\w+) ) /ix; while ( my $line = <$file_fh> ) { if ( $line =~ /$pattern/ ) { $company = join '_', $1, $2, $3, 'inc';
      You need to add \b or something in front of fred, barney and joe. You're not suppose to be matching alfred.
        Thanks,
        What would I do in case of hyphens? where, I had a line that contained:
        "pseudo-fred flintstone" and I wanted to skip it because this wasn't the real fred that I was keying on.
        $line =~ /(?=.*\bfred\s+(\w+)/ ; # would get "fred" and anything "-fred" # how would I avoid that?
      Great thanks, I was able to get it to work(sortof) with this line:
      $line =~ /^(?=.*fred\s+(\w+))(?=.*barney\s+(\w+))(?=.*joe\s+(\w+))/ ; $company = join '_', $1, $2, $3, 'inc';
        ... I was able to get it to work(sortof) ...

        This implies the solution does not entirely fulfill your needs. In what way does it fall short?

        Heh, I just replied with this very solution (see below). Put a question mark after each closing parenthesis and it should work.

        $line =~ /^(?=.*fred\s+(\w+))?(?=.*barney\s+(\w+))?(?=.*joe\s+(\w+))?/ + ; $company = join '_', $1, $2, $3, 'inc';
        --marmot
Re: Need a Regular Expression that tests for words in different order and captures the values found.
by Lain78 (Initiate) on Jan 15, 2010 at 12:31 UTC

    Scratching the idea I've come up with this snippet. Hope it's useful!

    @Lines = ( "fred flinstones barney rubble joe rockhead", "fred flinstones joe rockhead barney rubble", "barney rubble fred flinstones joe rockead", "barney rubble joe rockhead fred flinstones", "joe rockhead fred flinstones barney rubble", "joe rockead barney rubble fred flinstones", ); $LineNum = 0; foreach $Line (@Lines) { ++$LineNum; @Match = (); undef ($Company); @Match = $Line =~ /(?:\b(?:fred|barney|joe)\s+(\w+))+/g; $Company = join ("_", @Match) . '_inc' if (@Match); ($Company) ? (print "[$LineNum]: Company = $Company\n"): (next); }
Re: Need a Regular Expression that ... (Semi-OT)
by ww (Archbishop) on Jan 15, 2010 at 15:10 UTC

    As a latecomer to this thread, ++ to replies above as to syntax, applied to the logic of (some of?) OP's spec.

    My problem is with the spec, "find a regular expression that finds 2 or 3 words on the same line (no matter what the order) and captures the values beside the matching words." which also includes this requirement: "to parse the file for lines containing "Fred Flintstone" "Barney Rubble" "Joe Rockhead"...."

    #!/usr/bin/perl use strict; use warnings; # 817539 my $pattern = qr/ ^(\d) (?= .* \bfred \s+ (Flintstone) ) (?= .* \bbarney \s+ (Rubble) ) (?= .* \bjoe \s+ (Rockhead) ) /ix; while ( my $line = <DATA> ) { chomp $line; if ( $line =~ /$pattern/ ) { my $lineno = $1; my $company = join '_', $2, $3, $4, 'inc'; print "$lineno $company \n"; } elsif ( $line =~ /(\d).*/) { my $lineno = $1; print "$lineno, |$line|, does not match\n"; } } __DATA__ 1 bar Fred Flintstone Barney Rubble Joe Rockhead Alfred E Neuman 2 Joe Rockhead AE Neuman baz Fred Flintstone (does not contain name2) 3 Barney Rubble bat Fred Flintstone Joe Rockhead AE Neuman 4 Barney Rubble bat Joe Rockhead AE Neuman Fred Flintstone 5 Joe Jones bar Fred Flintstone Barney Rubble Alfred E Neuman(does not + contain Name3) 6 Barney Jones Barney Rubble Joe Rockhead Fred Smith (does not contai +n Name1 OR Name2) 7 Barney Rubble Fred Flintstone Joe Rockhead 8 Joe Rockhead Fred Flintstone Barney Rubble 9 Joe Rockhead Alfred Flintstone Barney Rubble (has Alfred sted Fred) 0 Joe Rockhead Fred Smith Barney Rubble (has Smith sted Flintstone)

    Output:

    1 Flintstone_Rubble_Rockhead_inc 2, |2 Joe Rockhead AE Neuman baz Fred Flintstone (does not contain nam +e2)|, does not match 3 Flintstone_Rubble_Rockhead_inc 4 Flintstone_Rubble_Rockhead_inc 5, |5 Joe Jones bar Fred Flintstone Barney Rubble Alfred E Neuman(does + not contain Name3)|, does not match 6, |6 Barney Jones Barney Rubble Joe Rockhead Fred Smith (does not co +ntain Name1 OR Name2)|, does not match 7 Flintstone_Rubble_Rockhead_inc 8 Flintstone_Rubble_Rockhead_inc 9, |9 Joe Rockhead Alfred Flintstone Barney Rubble (has Alfred sted Fr +ed)|, does not match 0, |0 Joe Rockhead Fred Smith Barney Rubble (has Smith sted Flintstone +)|, does not match

    In other words, is OP's spec, "2 or 3 words on the same line (no matter what the order)" sound and reasonable (esp. in light of the restriction to "Rubble,..." in another part of the question?

    I raise the question because it appears that OP is trying to construct company names (say, for example, "Flintstone_Rubble_Rockhead_Inc") without regard to the fact that "Flintstone_Rubble_Rockhead_Inc" is not the same company as "Rubble_Flintstone_Rockhead_Inc".

    BTW, Lain78's use of alternation addresses my question about the significance of the order of names but veers off the One_True_Way (True_One_Way ??) in failing to use strict; use warnings;.

    Update: Fixed missing i tag in 2nd quote; added "requirement" for clarity -- both in 2nd para.

    Update 2: OP's update in Re 4, "parsing settings from a config file" casts the problem very differently (assuming the inconsistent ordering in the config is inconsequential) but highlights a thought for any who consider posing a question with the real need obfuscated in a scenario unrelated to your real purpose: Don't; you may be wasting your time and that of the Monks who try to help.

Re: Need a Regular Expression that tests for words in different order and captures the values found.
by furry_marmot (Pilgrim) on Jan 15, 2010 at 15:35 UTC

    Use a zero-width positive lookahead assertion.

    $string = "This is barney rubble and his friends joe rockhead and fred + flintstone"; $string =~ /(?=.*fred (\w+))?(?=.*barney (\w+))?(?=.*joe (\w+))?/; $company = $1 . '_' . $2 . '_' . $3 . '_' . 'inc'; print "$company\n" # "flintstone_rubble_rockhead_inc"

    This prints "flintstone_rubble_rockhead_inc". It doesn't fail if one or more names are missing, and keeps the order of your captures -- that is, the word following barney is always $2 (if barney's there), even if fred is missing.

    $string = "This is bLarney rubble and his friends joe rockhead and fre +d flintstone"; $string =~ /(?=.*fred (\w+))?(?=.*barney (\w+))?(?=.*joe (\w+))?/; $company = $1 . '_' . $2 . '_' . $3 . '_' . 'inc'; print "$company\n" # "flintstone__rockhead_inc"
      It doesn't fail if one or more names are missing...

      But isn't this a bug rather than a feature? Is there any point to matching on a string that contains none of the target substrings (and then interpolates a bunch of undefined values)?

      Also, as pointed out by other respondents,  (?=.*fred (\w+)) will match  'alfred the great' (capturing  'the') in addition to  'fred flintstone'.

        Hmmmm....good point. Well, you can fix part of it with (?=.*\bfred\b (\w+)), but avoiding matching none of the targets is still a problem. The way I suggested will always get three matches because it says find "zero or more of this". That means either you can't actually find out how many non-empty matches you got, or you can only match when all the targets are present. Separate matches in a loop, as suggested by several others, is the way to go. Here's my take, redux:

        $string = "This is bLarney rubble and his friends joe rockhead and fre +d flintstone"; $count = 0; for $target (qw(fred barney joe)) { if ( $string =~ /(?=.*\b$target (\w+))/i ) { push @elements, $1; $count++; } else { push @elements, ''; # as a placeholder } } if ($count >= 2) { print join('_', @elements), "_inc\n" } else { print "Didn't find at least 2 elements in the strin +g\n" } # prints flintstone__blockhead_inc # change 'joe' to 'moe' and you get > Didn't find at least 2 elements +in the string

        There ought to be something useful in there. :-)

        --marmot