in reply to Re: use locale broken?
in thread use locale broken?

I was hoping to have it work both when the user (shell) encoding is in either ISO-8859-1 or UTF-8. Maybe I'm better off forcefully converting all input and output to UTF-8 and have the code itself dealing with UNICODE only.

I still feel this is a bug in Perl, though.

Is there a way – perhaps debugging argument – to see what \w applies to?

Replies are listed 'Best First'.
Re^3: use locale broken? (\w)
by ikegami (Patriarch) on Mar 17, 2011 at 19:12 UTC

    Maybe I'm better off forcefully converting all input and output to UTF-8

    Yes. For many reasons, it is best to decode all inputs, and encode all output.

    I still feel this is a bug in Perl, though.

    I believe Perl doesn't support multi-byte locales (e.g. UTF-8).

    Effort is placed on Unicode instead instead of adding to the locale system.

    Is there a way – perhaps debugging argument – to see what \w applies to?

    perlre: Match a "word" character (alphanumeric plus "_").

    The following are equivalent:

    ( No, this is wrong )

    /\w/ # When no locale, when not restricted to ASCII /\p{Word}/ /[_\p{Alnum}]/ /[_\p{Alphabetic}\p{Nd}]/

    Derived property "Alphabetic". (100,520 codepoints in Perl 5.12.2)
    Unicode character category "Nd". (411 codepoints in Perl 5.12.2)

    Actual lists vary by version of Unicode and thus by version of Perl.