Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I was working on a regular expression today to extract phrases like "not-too-shabby" and "never-before-seen",. that is, three words separated by hyphens.

The regex isn't really the question here, it's "how best to extract multiple instance of this pattern from one line"?

I'm using something like this:

$teststring = 'blah not-so-good blah not-too-shabby '; while($teststring =~ m/^.* ([a-z]+-[a-z]+-[a-z]+) .*$/i){ ($x = $teststring) =~ s/(^.* )([a-z]+-[a-z]+-[a-z]+)( .*$)/\2/gi; $teststring =~ s/(^.* )([a-z]+-[a-z]+-[a-z]+)( .*$)/\1 phrase \3/gi; print "$x\n"; }

because I don't know if there's a smarter way.

Obviously that print line would be used to push them into an array in a real life setting.

I have to keep going with a WHILE, because there may be more than one instance in a line, but surely there's a smarter way to do it than check there's a match, then extract it with one regex, then remove it from the test string with another?

Replies are listed 'Best First'.
Re: Regular Expression To Extract Multiple Matches Pattern
by busunsl (Vicar) on Jan 07, 2002 at 16:05 UTC
    Use the g modifier in the while loop to iterate over the string:
    $teststring = 'blah not-so-good blah not-too-shabby '; while ($teststring =~ /([a-z]+-[a-z]+-[a-z]+)/gi) { print "$1\n"; }
      I must be missing something - Why are we using the character set match of [a-z] in place of \w ? The use of \w would make the resulting code a lot more readable. Eg.

      while ($teststring =~ /\b(\w+-\w+-\w+)\b/gi) { print "$1\n"; }

      Also too, the boundary markers \b as suggested in the reply by Kanji have merit and I think warrant inclusion.

       

      Update

      As busunsl rightly points out, \w includes the underscore character in matching which has not been specified for inclusion ... [\w[^_]] anyone? :-)

       

      perl -e 's&&rob@cowsnet.com.au&&&split/[@.]/&&s&.com.&_&&&print'

        Perhaps because \w includes the underscore and that was not asked for.
        [\w[^_]]
        Nested character classes aren't implemented yet... That will parse somthing like this:
        [ # start char class \w # any word char [ # or a literal '[' ^ # or a literal '^' _ # or an underscore (redundant...) ] # end char class ] # followed by a literal ']'
        If you want a character class consisting of all the word chars except underscore, you need to use the double negative (and somewhat non-intuitive):
        [^\W_]
        Which matches a character that is not a non word char (i.e a word char) and not an underscore.
        % perl -le '/[^\W_]/ && print for qw(a b _ c d)' a b c d

        -Blake

Re: Regular Expression To Extract Multiple Matches Pattern
by Kanji (Parson) on Jan 07, 2002 at 16:06 UTC

    If you're replacing the string you're searching for, there's no need to use m/// at all ...

    while ( $teststring =~ s/\b([a-z]+-[a-z]+-[a-z]+)\b/phrase/i ) { print $1; }

        --k.


      Original Post:

      I was working on a regular expression today to extract phrases like "not-too-shabby" and "never-before-seen",. that is, three words separated by hyphens

      The original post does talk about 'extracting' not 'replacing'. And something like print $1; always raises a red flag for me -- you usually want print "$1\n"; instead, otherwise the doubly hyphenated words that the script has found will run into each other -- probably not what you want.

      --t. alex

      "Excellent. Release the hounds." -- Monty Burns.

        Further on down in the post, Cody says ...

        ...but surely there's a smarter way to do it than check there's a match, then extract it with one regex, then remove it from the test string with another?

        Perhaps I'm reading the bolded part of that wrong, but it sounds like replacing to me; a notion reinforced by the code snippet given.

        The $1 was an oversight, though: an artifact of using -l in my shebang. ;)

            --k.


      And certainly those parts you left out from s/(^.* )([a-z]+-[a-z#93;+-[a-z#93;+)( .*$)/\1 phrase \3/gi; look hideously ugly.