in reply to umlauts, special chars in perl regular expressions

Answer: it depends how the data is encoded. If it is utf8, \w will use utf8 rules for what is a letter (though this has been hotly debated; you may be better off using \p{Word} instead; see 5.8's perlre).

If there are those characters in the 128-255 range but it is not utf8 encoded, you either make it so (see utf8), or do "use locale;" and have the LANG environment var set to a suitable locale.

  • Comment on Re: umlauts, special chars in perl regular expressions

Replies are listed 'Best First'.
Re: Re: umlauts, special chars in perl regular expressions
by amonroy (Scribe) on Apr 21, 2004 at 23:32 UTC
    how do you make sure a string it's utf-8 encoded?
    I tried this and I don't get what I would expect.
    my $string = 'e1ņe'; if ($string =~ /^\w+$/) { print "yes"; } else { print "no"; } print "\n"; __OUTPUT___ yes
      Some of the ways:
      $outstr = $instr; utf8::upgrade($outstr); # or $outstr = Encode::decode("latin-1", $instr); # or add and remove a utf8 character: $outstr = $instr . "\x{100}"; chop $outstr;