nafion112 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All!

I'm trying to use a regex to test if a string matches one of a given set of other strings. I'm having trouble finding a regex method that doesn't have some holes in it. For example:

I want to know if $var contains either abc, def, or ghi. Nothing but those 3 strings should match. I've tried $var =~ /abc|def|ghi/ but a string such as abdefhi is a false positive. I've also tried /(abc|def|ghi)/ and /(abc)|(def)|(ghi)/ but the aforementioned abdefhi matches all of those.

Does anyone have a solution to this? Should I even be using a regex to test something like this in the first place?

Thanks everyone!

Replies are listed 'Best First'.
Re: regex matching specific strings.
by kennethk (Abbot) on Jul 22, 2009 at 22:47 UTC
    You could use the anchors ^ and $ (Regular Expressions) to force your string to match at beginning and ending of the string respectively. So your code could look something like:

    $var =~ /^(?:(?:abc)|(?:def)|(?:ghi))$/;

    Regarding skipping regular expressions entirely, since you are testing literal equality, it may make more sense from a maintenance perspective to just test equality with a set of or clauses (perlop):

    if ($var eq 'abc' or $var eq 'def' or $var eq 'ghi') { ..code..}

    Update: Fixed code, as per ikegami's post below

      $var =~ /^(?:abc)|(?:def)|(?:ghi)$/; is wrong. It matches strings that

      • start with "abc"
      • contain "def",
      • end with "ghi", or
      • end with "ghi\n"

      You want

      $var =~ /^(?:abc|def|ghi)\z/;

        My only tiny quibble with that regex, correct as it is, is that using ^ in combination with \z might be confusing to a future reader. I'd suggest:

        $var =~ /\A(?:abc|def|ghi)\z/;

        Which is technically exactly the same as ikegami's suggestion, but points out to the reader "Hey I'm using less-common techniques in this regex" right at the beginning with that \A.

Re: regex matching specific strings.
by ELISHEVA (Prior) on Jul 23, 2009 at 05:41 UTC
    I've tried $var =~ /abc|def|ghi/ but a string such as abdefhi is a false positive. I've also tried /(abc|def|ghi)/ and /(abc)|(def)|(ghi)/ but the aforementioned abdefhi matches all of those.

    /abc|def|ghi/ and /(abc)|(def)|(ghi)/ match "abdefhi" because it isn't anchored. Thus it can match "def" anywhere in the string, including in the middle of the string. To force Perl to match "abc","def", or "ghi" to the whole string, one must anchor the regular expression with "^" and "$" (or "\z"). "^" means match just before the first character. "\z" means the end of the string. "$" means match the end of the string or just before the first new-line, whichever comes first.

    To add "^" and "$" you must surround "abc|def|ghi" with parenthesis. Either capturing (...) or non-capturing (?:...) may be used. Otherwise Perl will think that "^" belongs only to the first regular expression. For example, in $var =~ /^abc|def|ghi\z/; Perl will think that you are looking for one of three alternatives: "abc" at the beginning of string, "def" anywhere in the string, or "ghi" at the end of the string. By contrast, /^(abc|def|ghi)\z/ and /^(?:abc|def|ghi)\z/ (see post by ikegami) will only look for all three strings (abc, def, ghi) only at the beginning of the string.

    In this case non-capturing parenthesis are the better choice. Capturing parenthesis stuff whatever they match inside a variable. But in this case, if the regex matches at all, it matches the whole string so you already have it in a variable.

    Hope this explains why the regexs given by kennethk and ikegami do work.

    Best, beth

    Update - 2009-07-27 - struck out portion below as incorrect or no longer applicable: /abc|def|ghi matches "abc" or "def" or "ghi" anywhere in the string.

    "|" only defines alternatives between adjacent regex components, so /abc|def|ghi/ and (abc|def|ghi) both mean match "ab" followed by either c or d followed by "e", followed by either f or g followed by "hi". To get "|" to treat "abc", "def", and "ghi" as alternative whole strings you must surround each string "abc","def", "ghi" with non-capturing regular expression. Non-capturing parenthesis are spelled (?:regex). They tell Perl - treat this sequence of letters as a single regular expression.

    You can also surround "abc","def","ghi" with plain parenthesis. Plain parentheseis also group sequences of letters into a single regular expression, but they also "capture" the match and stuff it into a variable.

    This is wasteful unless you need to stuff the match into a variable. Even if you do need to stuff the match into a variable, it probably won't do what you expect. Perl will treat each match as a separate variable and populate $1 with "abc" if $var contains "abc" and undef if it doesn't. To stuff whichever of the three happen to match into $1, one needs to surround the whole set of alternatives with a capturing regular expression, like this: ((?:abc)|(?:def)|(?:ghi)).