Regular expression matching when it shouldn't

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular expression matching when it shouldn't by OeufMayo (Curate) on Aug 15, 2001 at 13:30 UTC
Your regular expression does match because it is really present inside the string: "`[someRandomName].`" (The text in bracket is the matched text) "`[someRandomName]-`" What you want is anchoring the regular expression. Try adding a '^' at the beginning and a '$' at the end of the regular expression. The Regex will then try to match against the whole string, and not only on a part of it: `/^[A-Z0-9]+[A-Z0-9-]+[A-Z0-9]{1}$/i` Hope this helps. See perlre for more infos about this `#!/usr/bin/perl -w use strict; my $domain = $ARGV[0]; print "Got $domain\n"; if ($domain =~ /^[A-Z0-9]+[A-Z0-9-]+[A-Z0-9]$/i) { print "$domain matched\n\n"; } else { print "$domain did not match\n\n"; } __END__ Got someRandomName. someRandomName. did not match Got someRandomName- someRandomName- did not match Got someRandomName someRandomName matched Got someRandomName-X someRandomName-X matched Got someRandomName-Foo someRandomName-Foo matched` [download] <kbd>-- my $OeufMayo = new PerlMonger::Paris({http => 'paris.mongueurs.net'});</kbd>	[reply] [d/l] [select]
Re: Regular expression matching when it shouldn't by tachyon (Chancellor) on Aug 15, 2001 at 13:50 UTC
As you have already got the answer I just thought I might mention that the {1} does nothing as the default for a character class match is 1 char. If you just want to grab say the first bit of the domain foo.com (ie the bit before the first . ) then a simple regex like this will do the trick. `$domain = 'foo.com.au'; (my $first_bit) = $domain =~ m/([^.]+)/; print $first_bit;` [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Regular expression matching when it shouldn't by Hofmator (Curate) on Aug 15, 2001 at 14:30 UTC
Maybe I'm not getting what you mean but I think you are not understanding pattern matching correctly. Can someone tell me why it is matching the . or the - at the end of the line? In both cases, the last character is not matched by the regex! `if ($domain =~ /[A-Z0-9]+[A-Z0-9-]+[A-Z0-9]/i) { # do A } else { # do B }` [download] does A or B depending on whether the pattern matched the string $domain or not. It does not change $domain in any way, however. And your pattern does not have to match the whole string: `# /[A-Z0-9]+[A-Z0-9-]+[A-Z0-9]/i matches someDomain someDomain.. ..someDomain.. _!@#!@#!@#01a!@#$ # and so on` [download] If you are interested in what was matched, you have to use capturing parentheses like this: `if ($domain =~ /([A-Z0-9]+[A-Z0-9-]+[A-Z0-9])/i) { # $1 contains what is in the first pair of parentheses print "$1 matched\n\n"; } else { print "$domain did not match\n\n"; }` [download] Play around with this code and different pattern. Furthermore include the assertion ^ and $ into the pattern as mentioned by OeufMayo and play some more. -- Hofmator	[reply] [d/l] [select]
Re: Regular expression matching when it shouldn't by dga (Hermit) on Aug 15, 2001 at 22:41 UTC
Your match isnt anchored. There are 3 character classes so on someRandomName- it could successfully match like so. The first character class matches 's' the second matches 'o' and the third matches 'm' so you have a successful match Since RE's are greedy whats really happening is that the first matches the whole thing except the - the second set matches the - the third fails. Then the RE backtracks and the first matches everything but 'e-' and the second matches e- and the third fails. Then the first matches up to 'ame-' the the second matches 'ame-' the 3rd fails the second backtracks until it matches 'm' then the 3rd matches 'e' and SUCESS! If you want to match at end of string you need to add a $ to the end or if you want to match words (my take on the question) you need to add \b to both ends like so: `if ($domain =~ /[A-Za-z0-9]+[A-Za-z0-9-]+[A-Za-z0-9]$/) #matches at en +d of text if ($domain =~ /\b[A-Za-z0-9]+[A-Za-z0-9-]+[A-Za-z0-9]\b/) #matches on +ly words seperate from non words.` [download] Also I got rid of the /i which should be faster but as has been noted above you should use the \w type escapes where possible instead of the spelled out classes. Your current code will also match my-do^%$%^@#@ which you don't seem to intend the \b will fix that since the word in question has stuff not in the classes and so it is a fail. In fact your RE will match any 3 legal (based on your RE) sequence in any text. Hope this helps.	[reply] [d/l]
Re: Regular expression matching when it shouldn't by FoxtrotUniform (Prior) on Aug 15, 2001 at 21:05 UTC
You have: `/[A-Z0-9]+[A-Z0-9]+[A-Z0-9]{1}/i` [download] as your regex. I'm guessing that you want to match at least three alphanumeric characters. (I'd write this as `/[A-Z0-9]{3,}/i` [download] or `/\w{3,}/` [download] if you don't mind matching _s as well.) Anyways, you're getting matches that you don't expect because m// matches substrings, not whole lines. So, if you give this regex a string with at least three consecutive alphanumerics, it'll match regardless of what else is in the string. If you want to match only alphanumerics, you want something like this: `/^ # match the start of the line [A-Z0-9]{3,} # match at least 3 alnums, beginning at the start of the + line $ # match the end of the line -- we've only matched alnums /xi;` [download] or, much more concisely: `/^[A-Z0-9]{3,}$/i;` [download] If you're going to be writing a lot of perl, I strongly suggest that you learn regexes well. Mastering Regular Expressions is, IMO, the best book for the job; the Camel book also has some good stuff. You might find Steve Litt's tutorial useful as well. `-- :wq`	[reply] [d/l] [select]