glassel has asked for the wisdom of the Perl Monks concerning the following question:

In regular expressions, \w matches ordinary (e.g. ascii) word characters, not, however, utf8 multibyte characters. Is there a possibility to match the full class of utf8 codes?

Replies are listed 'Best First'.
Re: match utf8
by tobyink (Canon) on Nov 12, 2012 at 13:54 UTC

    Unless you're using an ancient version of Perl, \w should match any Unicode word character. According to perlre there are over 100,000 characters it matches.

    use 5.010; use strict; use warnings; use utf8::all; my $string = "the café"; say "GOT: $1" if $string =~ /(\w{4})/;

    Make sure your strings are being interpreted as character strings rather than byte strings though. (See perlunicode and utf8.)

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      As shown here, locale can also influence the behaviour of qr/\w/. Using qr/\w/u should also help.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: match utf8
by gnork (Scribe) on Nov 12, 2012 at 13:54 UTC
    \p{Letter} is the corresponding UTF8 aware character class for \w


    cat /dev/world | perl -e "(/(^.*? \?) 42\!/) && (print $1))"
    errors->(c)
Re: match utf8
by choroba (Cardinal) on Nov 12, 2012 at 13:43 UTC
    Can you give more information? What characters are you trying to match? Are you handling the encoding right?
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: match utf8
by ikegami (Patriarch) on Nov 13, 2012 at 02:40 UTC
    None of them deal with UTF-8. The regex matching engine expects Unicode codepoints. Decode your input (e.g. using Encode's decode) first, then \w will work.