match utf8

glassel has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: match utf8 by tobyink (Canon) on Nov 12, 2012 at 13:54 UTC
Unless you're using an ancient version of Perl, `\w` should match any Unicode word character. According to perlre there are over 100,000 characters it matches. `use 5.010; use strict; use warnings; use utf8::all; my $string = "the café"; say "GOT: $1" if $string =~ /(\w{4})/;` [download] Make sure your strings are being interpreted as character strings rather than byte strings though. (See perlunicode and utf8.) `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l] [select]
Re^2: match utf8 by choroba (Cardinal) on Nov 12, 2012 at 14:05 UTC
As shown here, locale can also influence the behaviour of `qr/\w/`. Using `qr/\w/u` should also help. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re: match utf8 by gnork (Scribe) on Nov 12, 2012 at 13:54 UTC
\p{Letter} is the corresponding UTF8 aware character class for \w cat /dev/world \| perl -e "(/(^.*? \?) 42\!/) && (print $1))" errors->(c)	[reply]
Re: match utf8 by choroba (Cardinal) on Nov 12, 2012 at 13:43 UTC
Can you give more information? What characters are you trying to match? Are you handling the encoding right? لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re: match utf8 by ikegami (Patriarch) on Nov 13, 2012 at 02:40 UTC
None of them deal with UTF-8. The regex matching engine expects Unicode codepoints. Decode your input (e.g. using Encode's `decode`) first, then `\w` will work.	[reply] [d/l] [select]