$ perl -le'
$_ = "<?xml version=\"1.0\"?><root>\x{C9}ric</root>";
utf8::encode($_);
utf8::downgrade($_);
print length;
print do { use bytes; length };
'
39
39
You can get the wrong answer if you use use bytes;:
$ perl -le'
$_ = "<?xml version=\"1.0\"?><root>\x{C9}ric</root>";
utf8::encode($_);
utf8::upgrade($_);
print length;
print do { use bytes; length };
'
39
41 XXX Should be 39
If the XML hasn't been encoded, use bytes can give you the right result if the desired encoding is UTF-8, but it's unreliable:
$ perl -le'
$_ = "<?xml version=\"1.0\"?><root>\x{C9}ric</root>";
print do { use bytes; length };
'
38 XXX Should be 39
In no case is use bytes; the appropriate answer.
Perl has two different formats for storing strings. use bytes; causes opcodes to look directly at the internal buffer of the string no matter which format was used. Since Perl is free to change how it internally stores the string at will, it's quite useless to use use bytes; without taking into checking which format Perl used for that string.
Update: Rephrased for clarity. |