wouldbewarrior has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have a very simple question - how does perl evaluate special accented characters like the enya in spanish or the umlaut in german? If I use the \w operator, will these characters be included or excluded from the resulting search?
--update--

thanks a lot!
  • Comment on umlauts, special chars in perl regular expressions

Replies are listed 'Best First'.
Re: umlauts, special chars in perl regular expressions
by kvale (Monsignor) on Apr 21, 2004 at 20:29 UTC
    Perl can hadle these and more. perl uses Unicode when warranted and uses the utf-8 encoding in particular. Check out perlunicode for general concepts and perlretut and perlre for advice on character classes (of which \w is implicitly) and coding of such characters.

    -Mark

Re: umlauts, special chars in perl regular expressions
by hardburn (Abbot) on Apr 21, 2004 at 20:29 UTC

    If you're using a 5.8-series of Perl, the input will be automatically detected as Unicode and Do The Right Thing. With a 5.6-series Perl, you need to add use utf8;. Anything lower than 5.6 probably won't handle it at all.

    ----
    : () { :|:& };:

    Note: All code is untested, unless otherwise stated

Re: umlauts, special chars in perl regular expressions
by borisz (Canon) on Apr 21, 2004 at 22:21 UTC
    It depends also on your locale setting, even if you do not use UTF-8. Try this:
    perl -e'print ((sort grep /\w/, map { chr } 0..255), $/)' __END__ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
    The output for de depends if your system use utf8 or iso8859-15 for de.
    export LC_ALL=de_DE@euro perl -e 'use locale; print ((sort grep /\w/, map { chr }0..255), $/);' __END__ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz??ª?µ?º +???ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
    Boris
Re: umlauts, special chars in perl regular expressions
by ysth (Canon) on Apr 21, 2004 at 22:11 UTC
    Answer: it depends how the data is encoded. If it is utf8, \w will use utf8 rules for what is a letter (though this has been hotly debated; you may be better off using \p{Word} instead; see 5.8's perlre).

    If there are those characters in the 128-255 range but it is not utf8 encoded, you either make it so (see utf8), or do "use locale;" and have the LANG environment var set to a suitable locale.

      how do you make sure a string it's utf-8 encoded?
      I tried this and I don't get what I would expect.
      my $string = 'e1ñe'; if ($string =~ /^\w+$/) { print "yes"; } else { print "no"; } print "\n"; __OUTPUT___ yes
        Some of the ways:
        $outstr = $instr; utf8::upgrade($outstr); # or $outstr = Encode::decode("latin-1", $instr); # or add and remove a utf8 character: $outstr = $instr . "\x{100}"; chop $outstr;