Re: Regex Matching Unicode and Regex Classes

Replies are listed 'Best First'.
Re^2: Regex Matching Unicode and Regex Classes by McA (Priest) on Nov 02, 2011 at 14:27 UTC
Hi Moritz, but what is then the difference to the third case? Is the "default Unicode semantic" changed to something different when local is enabled? Why is "U+00E4 LATIN SMALL LETTER A WITH DIAERESIS" under locale something different than a letter which is part of a word? Best regards Andreas	[reply]
Re^3: Regex Matching Unicode and Regex Classes by moritz (Cardinal) on Nov 02, 2011 at 14:41 UTC
Short answer: because Unicode and locales don't mix. Long answer: Perl's support for locales comes from a time before the whole encoding/decoding business and Unicode support. So if locales are active, the locale-sensitive parts expect to act on bytes, not on decoded strings. Since the locale is not ISO-8859-1 but UTF-8, encoding to Latin-1 doesn't fix it for you. If anything, you'd need to encode to UTF-8 to see the \w matching ä, but even then I don't see it matching. So either my understanding of locales is very wrong, or perl is broken (or a mixture thereof). Perl 6 - second systems done right	[reply]
Re^4: Regex Matching Unicode and Regex Classes by McA (Priest) on Nov 02, 2011 at 14:48 UTC
Hi Moritz, that sounds plausible, but not satisfying. ;-) What is then the right approach to find word boundaries with regex while locale is enabled? Best regards Andreas	[reply]
Re^5: Regex Matching Unicode and Regex Classes by moritz (Cardinal) on Nov 02, 2011 at 14:53 UTC
Re^5: Regex Matching Unicode and Regex Classes by choroba (Cardinal) on Nov 02, 2011 at 14:58 UTC
Re^2: Regex Matching Unicode and Regex Classes by McA (Priest) on Nov 02, 2011 at 15:00 UTC
Moritz, thanky ou for your answers. Have a nice day. Best regards Andreas	[reply]


Just another Perl shrine
	PerlMonks