Regular Expression To Extract Multiple Matches Pattern

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I was working on a regular expression today to extract phrases like "not-too-shabby" and "never-before-seen",. that is, three words separated by hyphens.

The regex isn't really the question here, it's "how best to extract multiple instance of this pattern from one line"?

I'm using something like this:

$teststring = 'blah not-so-good blah not-too-shabby ';
while($teststring =~ m/^.* ([a-z]+-[a-z]+-[a-z]+) .*$/i){
  ($x = $teststring) =~ s/(^.* )([a-z]+-[a-z]+-[a-z]+)( .*$)/\2/gi;
  $teststring =~ s/(^.* )([a-z]+-[a-z]+-[a-z]+)( .*$)/\1 phrase \3/gi;
  print "$x\n";
}
[download]

because I don't know if there's a smarter way.

Obviously that print line would be used to push them into an array in a real life setting.

I have to keep going with a WHILE, because there may be more than one instance in a line, but surely there's a smarter way to do it than check there's a match, then extract it with one regex, then remove it from the test string with another?

Comment on Regular Expression To Extract Multiple Matches Pattern Download Code

Replies are listed 'Best First'.
Re: Regular Expression To Extract Multiple Matches Pattern by busunsl (Vicar) on Jan 07, 2002 at 16:05 UTC
Use the g modifier in the while loop to iterate over the string: `$teststring = 'blah not-so-good blah not-too-shabby '; while ($teststring =~ /([a-z]+-[a-z]+-[a-z]+)/gi) { print "$1\n"; }` [download]	[reply] [d/l]
Re: Re: Regular Expression To Extract Multiple Matches Pattern by rob_au (Abbot) on Jan 07, 2002 at 16:18 UTC
I must be missing something - Why are we using the character set match of `[a-z]` in place of `\w` ? The use of `\w` would make the resulting code a lot more readable. Eg. `while ($teststring =~ /\b(\w+-\w+-\w+)\b/gi) { print "$1\n"; }` [download] Also too, the boundary markers `\b` as suggested in the reply by Kanji have merit and I think warrant inclusion. Update As busunsl rightly points out, `\w` includes the underscore character in matching which has not been specified for inclusion ... `[\w[^_]]` anyone? :-) `perl -e 's&&rob@cowsnet.com.au&&&split/[@.]/&&s&.com.&_&&&print'`	[reply] [d/l] [select]
Re: Re: Re: Regular Expression To Extract Multiple Matches Pattern by busunsl (Vicar) on Jan 07, 2002 at 16:20 UTC
Perhaps because \w includes the underscore and that was not asked for.	[reply]
Re: Re: Re: Regular Expression To Extract Multiple Matches Pattern by blakem (Monsignor) on Jan 08, 2002 at 00:23 UTC
`[\w[^_]]` [download] Nested character classes aren't implemented yet... That will parse somthing like this: `[ # start char class \w # any word char [ # or a literal '[' ^ # or a literal '^' _ # or an underscore (redundant...) ] # end char class ] # followed by a literal ']'` [download] If you want a character class consisting of all the word chars except underscore, you need to use the double negative (and somewhat non-intuitive): `[^\W_]` [download] Which matches a character that is not a non word char (i.e a word char) and not an underscore. `% perl -le '/[^\W_]/ && print for qw(a b _ c d)' a b c d` [download] -Blake	[reply] [d/l] [select]
Re: Regular Expression To Extract Multiple Matches Pattern by Kanji (Parson) on Jan 07, 2002 at 16:06 UTC
If you're replacing the string you're searching for, there's no need to use `m///` at all ... `while ( $teststring =~ s/\b([a-z]+-[a-z]+-[a-z]+)\b/phrase/i ) { print $1; }` [download] --k.	[reply] [d/l]
Re: Re: Regular Expression To Extract Multiple Matches Pattern by talexb (Chancellor) on Jan 07, 2002 at 19:37 UTC
Original Post: I was working on a regular expression today to extract phrases like "not-too-shabby" and "never-before-seen",. that is, three words separated by hyphens The original post does talk about 'extracting' not 'replacing'. And something like `print $1;` always raises a red flag for me -- you usually want `print "$1\n";` instead, otherwise the doubly hyphenated words that the script has found will run into each other -- probably not what you want. --t. alex "Excellent. Release the hounds." -- Monty Burns.	[reply] [d/l] [select]
Re: Re: Re: Regular Expression To Extract Multiple Matches Pattern by Kanji (Parson) on Jan 07, 2002 at 20:15 UTC
Further on down in the post, Cody says ... ...but surely there's a smarter way to do it than check there's a match, then extract it with one regex, then remove it from the test string with another? Perhaps I'm reading the bolded part of that wrong, but it sounds like replacing to me; a notion reinforced by the code snippet given. The $1 was an oversight, though: an artifact of using `-l` in my shebang. `;)` --k.	[reply]
Re: Re: Re: Re: Regular Expression To Extract Multiple Matches Pattern by Cody Pendant (Prior) on Jan 08, 2002 at 01:02 UTC
Re^2: Regular Expression To Extract Multiple Matches Pattern by Aristotle (Chancellor) on Jan 07, 2002 at 17:49 UTC
And certainly those parts you left out from `s/(^.* )([a-z]+-[a-z#93;+-[a-z#93;+)( .*$)/\1 phrase \3/gi;` look hideously ugly.	[reply]