in reply to Matching/replacing a unicode character only works after decode()
Why?! Before decoding the utf8 string, how could the string go from input to output unchanged but fail to match the regex?Basically, that's because Perl by default thinks that a binary string is in Latin-1, rather then UTF-8. And that's a problem - every string in any encoding (UTF-8 or anything else) is valid Latin-1.
Charater \xb5 is one byte in Latin-1, but two bytes in UTF-8. And \x3bc is just too big for a one byte encoding.
Why do I need to decode the utf8 string to match an utf8 characterIf you have some string in UTF-8, and want to apply regexes to it, or get it's length in characters, etc... you always have to do that. Because backwards compatibility. Perl is old. Other languages (Python, Ruby) broke compatiblity to get better Unicode. Perl didn't.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Matching/replacing a unicode character only works after decode()
by Your Mother (Archbishop) on Jul 25, 2014 at 15:32 UTC | |
by Anonymous Monk on Jul 25, 2014 at 20:15 UTC | |
by Anonymous Monk on Jul 25, 2014 at 16:45 UTC | |
|
Re^2: Matching/replacing a unicode character only works after decode()
by ikegami (Patriarch) on Jul 27, 2014 at 04:03 UTC |