umlauts, special chars in perl regular expressions

wouldbewarrior has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: umlauts, special chars in perl regular expressions by kvale (Monsignor) on Apr 21, 2004 at 20:29 UTC
Perl can hadle these and more. perl uses Unicode when warranted and uses the utf-8 encoding in particular. Check out perlunicode for general concepts and perlretut and perlre for advice on character classes (of which \w is implicitly) and coding of such characters. -Mark	[reply]
Re: umlauts, special chars in perl regular expressions by hardburn (Abbot) on Apr 21, 2004 at 20:29 UTC
If you're using a 5.8-series of Perl, the input will be automatically detected as Unicode and Do The Right Thing. With a 5.6-series Perl, you need to add `use utf8;`. Anything lower than 5.6 probably won't handle it at all. ---- `: () { :\|:& };:` Note: All code is untested, unless otherwise stated	[reply] [d/l] [select]
Re: umlauts, special chars in perl regular expressions by borisz (Canon) on Apr 21, 2004 at 22:21 UTC
It depends also on your locale setting, even if you do not use UTF-8. Try this: `perl -e'print ((sort grep /\w/, map { chr } 0..255), $/)' __END__ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz` [download] The output for de depends if your system use utf8 or iso8859-15 for de. `export LC_ALL=de_DE@euro perl -e 'use locale; print ((sort grep /\w/, map { chr }0..255), $/);' __END__ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz??ª?µ?º +???ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ` [download] Boris	[reply] [d/l] [select]
Re: umlauts, special chars in perl regular expressions by ysth (Canon) on Apr 21, 2004 at 22:11 UTC
Answer: it depends how the data is encoded. If it is utf8, \w will use utf8 rules for what is a letter (though this has been hotly debated; you may be better off using \p{Word} instead; see 5.8's perlre). If there are those characters in the 128-255 range but it is not utf8 encoded, you either make it so (see utf8), or do "use locale;" and have the LANG environment var set to a suitable locale.	[reply]
Re: Re: umlauts, special chars in perl regular expressions by amonroy (Scribe) on Apr 21, 2004 at 23:32 UTC
how do you make sure a string it's utf-8 encoded? I tried this and I don't get what I would expect. `my $string = 'e1ñe'; if ($string =~ /^\w+$/) { print "yes"; } else { print "no"; } print "\n"; __OUTPUT___ yes` [download]	[reply] [d/l]
Re: Re: Re: umlauts, special chars in perl regular expressions by ysth (Canon) on Apr 22, 2004 at 01:50 UTC
Some of the ways: `$outstr = $instr; utf8::upgrade($outstr); # or $outstr = Encode::decode("latin-1", $instr); # or add and remove a utf8 character: $outstr = $instr . "\x{100}"; chop $outstr;` [download]	[reply] [d/l]