Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

regex question

by Deda (Novice)
on Oct 01, 2003 at 11:13 UTC ( [id://295568]=perlquestion: print w/replies, xml ) Need Help??

Deda has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, i am investigating multibyte char matching, and came across this code:
#!/usr/bin/perl -w $search = "\x8C\x95"; $text1 = "Text 1 \x90\x56\x8C\x95\x93\xB9"; $text2 = "Text 2 \x94\x92\x8C\x8C\x95\x61"; $encoding = q{ # Shift-JIS encoding [\x00-\x7F] # ASCII/JIS-Roman | [\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC] # JIS X 0208:1997 | [\xA0-\xDF] # Half-width katakana }; print "First attempt -- no anchoring\n"; print " Matched Text1\n" if $text1 =~ /$search/o; print " Matched Text2\n" if $text2 =~ /$search/o; print "Second attempt -- anchoring\n"; print " Matched Text1\n" if $text1 =~ /^ (?:$encoding)*? $search/osx; print " Matched Text2\n" if $text2 =~ /^ (?:$encoding)*? $search/osx;
What i don't understand are these lines like:
print " Matched Text1\n" if $text1 =~ /^ (?:$encoding)*? $search/osx;
Can anyone help me out and explain, what they do...especially (?:$encoding)*? . 10x, Deda

Replies are listed 'Best First'.
Re: regex question
by flounder99 (Friar) on Oct 01, 2003 at 11:55 UTC
    YAPE::Regex::Explain is a handy tool to explain it for you.
    use YAPE::Regex::Explain; $search = "\x8C\x95"; $encoding = q{ # Shift-JIS encoding [\x00-\x7F] # ASCII/JIS-Roman | [\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC] # JIS X 0208:1997 | [\xA0-\xDF] # Half-width katakana }; $regex = qr/^ (?:$encoding)*? $search/osx; print YAPE::Regex::Explain->new($regex)->explain;
    Outputs:
    The regular expression: (?sx-im:^ (?: # Shift-JIS encoding [\x00-\x7F] # ASCII/JIS-Roman | [\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC] # JIS X 0208:1997 | [\xA0-\xDF] # Half-width katakana )*? от) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?sx-im: group, but do not capture (with . matching \n) (disregarding whitespace and comments) (case-sensitive) (with ^ and $ matching normally): ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- (?: group, but do not capture (0 or more times (matching the least amount possible)): ---------------------------------------------------------------------- [\x00-\x7F] any character of: '\x00' to '\x7F' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- [\x81-\x9F\xE0- any character of: '\x81' to '\x9F', \xFC] '\xE0' to '\xFC' ---------------------------------------------------------------------- [\x40-\x7E\x80- any character of: '\x40' to '\x7E', \xFC] '\x80' to '\xFC' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- [\xA0-\xDF] any character of: '\xA0' to '\xDF' ---------------------------------------------------------------------- )*? end of grouping ---------------------------------------------------------------------- от 'от' ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

    --

    flounder

      Thanx for pointing it out, i will surely use it in cases of confusing regexes.
Re: regex question
by tachyon (Chancellor) on Oct 01, 2003 at 11:32 UTC

    Well it is pretty simple.

    The /x lets you have whitespace and comments in your RE.

    (?:blah) means match 'blah' but don't capture into $1. This is useful if you want to group using () but not capture. In this case it is looking for * worth of ([\x00-\x7F]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]|[\xA0-\xDF]) where the star means 0 or more grabbing as many as possible. The ? after the * is a modifier that says don't grab as many, grab as few as possible (non greedy)

    The | is alternation ie this|or|that|or|other.

    [A-Z] is a character class that will match the letters A-Z in that case.

    The /s is totally redundant as it lets the . regex metachar match anything (normally mathces everything except \n) but as there are no . metachars it does zip.

    Finally the /o is a promise to Perl that $encoding and $search will not change during the entire runtime of the script so that Perl can interpolate these scalars, compile the RE and then not recompile ever again. If you change the interpolated scalars after the first time the RE runs Perl will not notice with /o

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      10x m8, If i understand correctly: the ?: stands for the "don't capture" thing? That part was confusing me. You don't have to RE if this is so.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://295568]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (8)
As of 2024-04-23 13:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found