regex question

Deda has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, i am investigating multibyte char matching, and came across this code:

#!/usr/bin/perl -w
$search = "\x8C\x95";
$text1 = "Text 1 \x90\x56\x8C\x95\x93\xB9";
$text2 = "Text 2 \x94\x92\x8C\x8C\x95\x61";

$encoding = q{ # Shift-JIS encoding
[\x00-\x7F] # ASCII/JIS-Roman
| [\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC] # JIS X 0208:1997
| [\xA0-\xDF] # Half-width katakana
};
print "First attempt -- no anchoring\n";
print " Matched Text1\n" if $text1 =~ /$search/o;
print " Matched Text2\n" if $text2 =~ /$search/o;
print "Second attempt -- anchoring\n";
print " Matched Text1\n" if $text1 =~ /^ (?:$encoding)*? $search/osx;
print " Matched Text2\n" if $text2 =~ /^ (?:$encoding)*? $search/osx;
[download]

What i don't understand are these lines like:

print " Matched Text1\n" if $text1 =~ /^ (?:$encoding)*? $search/osx;
[download]

Can anyone help me out and explain, what they do...especially (?:$encoding)*? . 10x, Deda

Comment on regex question Select or Download Code

Replies are listed 'Best First'.
Re: regex question by flounder99 (Friar) on Oct 01, 2003 at 11:55 UTC
YAPE::Regex::Explain is a handy tool to explain it for you. `use YAPE::Regex::Explain; $search = "\x8C\x95"; $encoding = q{ # Shift-JIS encoding [\x00-\x7F] # ASCII/JIS-Roman \| [\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC] # JIS X 0208:1997 \| [\xA0-\xDF] # Half-width katakana }; $regex = qr/^ (?:$encoding)? $search/osx; print YAPE::Regex::Explain->new($regex)->explain;` [download] Outputs: The regular expression: (?sx-im:^ (?: # Shift-JIS encoding [\x00-\x7F] # ASCII/JIS-Roman \| [\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC] # JIS X 0208:1997 \| [\xA0-\xDF] # Half-width katakana )? מע) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?sx-im: group, but do not capture (with . matching \n) (disregarding whitespace and comments) (case-sensitive) (with ^ and $ matching normally): ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- (?: group, but do not capture (0 or more times (matching the least amount possible)): ---------------------------------------------------------------------- [\x00-\x7F] any character of: '\x00' to '\x7F' ---------------------------------------------------------------------- \| OR ---------------------------------------------------------------------- [\x81-\x9F\xE0- any character of: '\x81' to '\x9F', \xFC] '\xE0' to '\xFC' ---------------------------------------------------------------------- [\x40-\x7E\x80- any character of: '\x40' to '\x7E', \xFC] '\x80' to '\xFC' ---------------------------------------------------------------------- \| OR ---------------------------------------------------------------------- [\xA0-\xDF] any character of: '\xA0' to '\xDF' ---------------------------------------------------------------------- )*? end of grouping ---------------------------------------------------------------------- מע 'מע' ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download] -- flounder	[reply] [d/l] [select]
Re: Re: regex question by Deda (Novice) on Oct 01, 2003 at 12:45 UTC
Thanx for pointing it out, i will surely use it in cases of confusing regexes.	[reply]
Re: regex question by tachyon (Chancellor) on Oct 01, 2003 at 11:32 UTC
Well it is pretty simple. The /x lets you have whitespace and comments in your RE. (?:blah) means match 'blah' but don't capture into $1. This is useful if you want to group using () but not capture. In this case it is looking for * worth of `([\x00-\x7F]\|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]\|[\xA0-\xDF])` where the star means 0 or more grabbing as many as possible. The ? after the * is a modifier that says don't grab as many, grab as few as possible (non greedy) The \| is alternation ie this\|or\|that\|or\|other. `[A-Z]` is a character class that will match the letters A-Z in that case. The /s is totally redundant as it lets the . regex metachar match anything (normally mathces everything except \n) but as there are no . metachars it does zip. Finally the /o is a promise to Perl that $encoding and $search will not change during the entire runtime of the script so that Perl can interpolate these scalars, compile the RE and then not recompile ever again. If you change the interpolated scalars after the first time the RE runs Perl will not notice with /o cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l] [select]
Re: Re: regex question by Deda (Novice) on Oct 01, 2003 at 11:39 UTC
10x m8, If i understand correctly: the ?: stands for the "don't capture" thing? That part was confusing me. You don't have to RE if this is so.	[reply]


Welcome to the Monastery
	PerlMonks