in reply to Re^4: performance of length() in utf-8
in thread performance of length() in utf-8
A read through of Unicode Support in perlguts as well as perluniintro, perlunitut, and perlunicode might be helpful for further clarifications.
If Perl always used UTF-8 for internal operation, things would be slow (as per the OP). So for strings that are representable via the system's codepage. Specifically (from perluniintro):
Internally, Perl currently uses either whatever the native eight-bit character set of the platform (for example Latin-1) is, defaulting to UTF-8, to encode Unicode strings. Specifically, if all code points in the string are 0xFF or less, Perl uses the native eight-bit character set. Otherwise, it uses UTF-8.So until Perl encounters a reason, it will not flip the UTF8 flag you are seeing via Devel::Peek. In your scenario, your XML parser sees the UTF-8 encode at the top of the file, and so the flag gets thrown. Note that if you run
you get something likeperl -MDevel::Peek -E "Dump chr 199"
The UTF8 flag is not set, and the same character is being stored according to the local code page.SV = PV(0x15ceba8) at 0x15ed468 REFCNT = 1 FLAGS = (PADTMP,POK,READONLY,pPOK) PV = 0x16359c0 "\307"\0 CUR = 1 LEN = 12
The thing that seems to be missing from your thinking is serialization. When you feed a string through encode_utf8, you are saying take this logical object, and encode it for communication via a channel that expects UTF-8, much like you might have a channel that expects JSON or a channel that expects little-endian. The resultant bit stream is the encoded stream, and none of the characters it contains are UTF-8 - logically, it contains no wide characters, though it may have a number of high-bit characters. You need to decode the stream in order for it to make sense logically. Now, if Perl is hooked up to a UTF-8 terminal, it'll look right, and if it's hooked up to a 1252 terminal, you'll get junk.
Hopefully this helps?
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^6: performance of length() in utf-8
by hippo (Archbishop) on Mar 11, 2016 at 23:21 UTC |