in reply to Re^5: Seeking Perl docs about how UTF8 flag propagates
in thread Seeking Perl docs about how UTF8 flag propagates
I believe you are thinking in terms of particular encodings, which knowledge of the Unicode bug wrongly tempts one to do. I'm not talking about taking an internal representation as a sequence of bytes and then flipping the UTF8 flag on that internal representation, I'm talking about the actual strings represented by the internal representation.
A string in Perl is a sequence of characters, not the sequence of bytes (or octets) that represents those characters in a particular encoding. If length() gives a different answer on two strings, then they are not the same sequence of characters.
In a Unicode world there is one string "fu\x{df}" consisting of three characters. Internally Perl might encode that in one of two different ways, resulting in different byte sequences and a different setting of the UTF8 flag, but it is the same string whichever encoding is used.
So in the code below, I would expect verify_upgraded_length and verify_downgraded_length to return a TRUE value for every string input (if they return at all).
use utf8 (); sub verify_upgraded_length { my($s) = @_; # wrong # my $u = utf8::upgrade($s); my $u = $s; utf8::upgrade($u); return length($s) == length($u); } sub verify_downgraded_length { my($s) = @_; # wrong # my $d = utf8::downgrade($s); # dies if downgrade not possible my $d = $s; utf8::downgrade($d); # dies if downgrade not possible return length($s) == length($d); }
[Updated: corrected code, thanks haj++]
|
---|