in reply to Word Exclusion Regex (was Re: regex problem)
in thread regex problem

As impressive as this is (and I haven't got it entirely figured out yet) there are a couple bugs. $first contain extra spaces when the group includes words that start with different letters... localize $" or just do a boring join to fix that. Also, words with multiple occurances of the first letter ('aabc' instead of 'abc') get excluded even when they shouldn't.

The following output shows several incorrect cases using an exclude list of qw(dog cat pig):

(?-xism:^[^p c d]*(?:(?:p(?!ig)|c(?!at)|d(?!og))[^p c d]*)*$) dog => cat => pig => owl => 1 ddog => ccat => ppig => pdog => pcat => elephant => ppppcatgggg =>

-Blake

Replies are listed 'Best First'.
Re: Re: Word Exclusion Regex (was Re: regex problem)
by japhy (Canon) on Feb 10, 2002 at 16:19 UTC
    Oops, the original version used join() when creating $first. I don't know why I changed it. As for the other complaint, the regex is designed to ensure the words don't appear at all. If you only wanted a regex that didn't match a string that is a set of words, it would look much simpler: /^(?!(?:cat|dog|pig)$)/. That's not what I was going for.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      whoops... my bad. must have munged the regex myself somehow...

      Pardon my conceit, as I don't mean to contradict this captivating regex, but I purport that its still not correct.... all of which happen to get incorrectly excluded for exclude('dog','cat','pig'): ;-P

      (?-xism:^[^pcd]*(?:(?:p(?!ig)|c(?!at)|d(?!og)))*[^pcd]*$) dog => cat => pig => owl => 1 conceit => contradict => captivating => purport => correct =>

      -Blake
      p.s. List obtained using:

      $ perl -lne 'print if /^[dpc].*[dpc]/ && !/dog|cat|pig/' /usr/dict/wor +ds
        Oh, and since you mentioned some intrigue as to the function of the regex, here's what it does:
        1. It matches as many letters as it can that don't start one of the forbidden words.
        2. Then it matches one of those letters, so long as it isn't followed by the rest of the word.
        3. Then it matches as many non-bad letters as it can.
        4. Go to step 2 if you can.
        Friedl would call this "unrolling the loop".

        _____________________________________________________
        Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        Um, you're using the wrong regex. My function returns ^[^pcd]*(?:(?:p(?!ig)|c(?!at)|d(?!og))[^pcd]*)*$ whereas you are using ^[^pcd]*(?:(?:p(?!ig)|c(?!at)|d(?!og)))*[^pcd]*$ The [^pcd]* got moved outside the (?:...) somehow. When I use the right regex, I get the right results.

        _____________________________________________________
        Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        I don't know where you got that regex from, but it's certainly not from japhy's code. The regex his code produces is: (?-xism:^[^pcd]*(?:(?:p(?!ig)|c(?!at)|d(?!og))[^p c d]*)*$) (That's also the regex you used in your earlier response.) The regex you used in your latest response has a parenthesis in the wrong place and is missing a quantifier; of couse it doesn't work!