utf8::decode vs. Encode::decode with regard to the length function

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

‹‹‹‹

I am having some interesting results trying to discern the differences between using Encode::decode("utf8", $var) and utf8::decode($var). I've already discovered that calling the former multiple times on a variable will eventually result in an error "Cannot decode string with wide characters at..." whereas the latter method will happily run as many times as you want, simply returning false.

What I'm having trouble understanding is how the length function returns different results depending on which method you use to decode. The problem arises because I am dealing with "doubly encoded" utf8 text from an outside file. To demonstrate this issue, I created a text file "test.txt" with the following Unicode characters on one line: U+00e8, U+00ab, U+0086, U+000a. These Unicode characters are the double-encoding of the Unicode character U+8acb, along with a newline character. The file was encoded to disk in UTF8. I then run the following perl script:

#!/usr/bin/perl                                                       
+                                                                     
+                  

use strict;

use warnings;

require "Encode.pm";

require "utf8.pm";



open FILE, "test.txt" or die $!;

my @lines = <FILE>;

my $test =  $lines[0];



print "Length: " . (length $test) . "\n";

print "utf8 flag: " . utf8::is_utf8($test) . "\n";

my @unicode = (unpack('U*', $test));

print "Unicode:\n@unicode\n";

my @hex = (unpack('H*', $test));

print "Hex:\n@hex\n";



print "==============\n";



$test = Encode::decode("utf8", $test);

print "Length: " . (length $test) . "\n";

print "utf8 flag: " . utf8::is_utf8($test) . "\n";

@unicode = (unpack('U*', $test));

print "Unicode:\n@unicode\n";

@hex = (unpack('H*', $test));

print "Hex:\n@hex\n";



print "==============\n";



$test = Encode::decode("utf8", $test);

print "Length: " . (length $test) . "\n";

print "utf8 flag: " . utf8::is_utf8($test) . "\n";

@unicode = (unpack('U*', $test));

print "Unicode:\n@unicode\n";

@hex = (unpack('H*', $test));

print "Hex:\n@hex\n";
[download]

This gives the following output:

Length: 7

utf8 flag: 

Unicode:

195 168 194 171 194 139 10

Hex:

c3a8c2abc28b0a

==============

Length: 4

utf8 flag: 1

Unicode:

232 171 139 10

Hex:

c3a8c2abc28b0a

==============

Length: 2

utf8 flag: 1

Unicode:

35531 10

Hex:

e8ab8b0a

This is what I would expect. The length is originally 7 because perl thinks that $test is just a series of bytes. After decoding once, perl knows that $test is a series of characters that are utf8-encoded (i.e. instead of returning a length of 7 bytes, perl returns a length of 4 characters, even though $test is still 7 bytes in memory). After the second decoding, $test contains 4 bytes interpreted as 2 characters, which is what I would expect since Encode::decode took the 4 code points and interpreted them as utf8-encoded bytes, resulting in 2 characters. The strange thing is when I modify the code to call utf8::decode instead:

#!/usr/bin/perl                                                       
+                                                                     
+                  

use strict;

use warnings;

require "Encode.pm";

require "utf8.pm";



open FILE, "test.txt" or die $!;

my @lines = <FILE>;

my $test =  $lines[0];



print "Length: " . (length $test) . "\n";

print "utf8 flag: " . utf8::is_utf8($test) . "\n";

my @unicode = (unpack('U*', $test));

print "Unicode:\n@unicode\n";

my @hex = (unpack('H*', $test));

print "Hex:\n@hex\n";



print "==============\n";



utf8::decode($test);

print "Length: " . (length $test) . "\n";

print "utf8 flag: " . utf8::is_utf8($test) . "\n";

@unicode = (unpack('U*', $test));

print "Unicode:\n@unicode\n";

@hex = (unpack('H*', $test));

print "Hex:\n@hex\n";



print "==============\n";



utf8::decode($test);

print "Length: " . (length $test) . "\n";

print "utf8 flag: " . utf8::is_utf8($test) . "\n";

@unicode = (unpack('U*', $test));

print "Unicode:\n@unicode\n";

@hex = (unpack('H*', $test));

print "Hex:\n@hex\n";
[download]

This gives almost identical output, only the result of length differs:

Length: 7

utf8 flag: 

Unicode:

195 168 194 171 194 139 10

Hex:

c3a8c2abc28b0a

==============

Length: 4

utf8 flag: 1

Unicode:

232 171 139 10

Hex:

c3a8c2abc28b0a

==============

Length: 4

utf8 flag: 1

Unicode:

35531 10

Hex:

e8ab8b0a

It seems like perl first counts the bytes before decoding (as expected), then counts the characters after the first decoding, but then counts the bytes again after the second decoding (not expected). Why would this switch happen? Is there a lapse in my understanding of how these decoding functions work?

Thanks,
Matt

Comment on utf8::decode vs. Encode::decode with regard to the length function Select or Download Code

Replies are listed 'Best First'.
Re: utf8::decode vs. Encode::decode with regard to the length function by ikegami (Patriarch) on Dec 02, 2010 at 20:10 UTC
use strict; use warnings; use feature qw( say ); use Encode qw( ); my $orig = "\xE8\xAB\x86\x0A"; utf8::encode( my $enc_once = $orig ); utf8::encode( my $enc_twice = $enc_once ); say('length = ', length($orig)); { say("Using utf8::decode"); utf8::decode( my $dec_once = $enc_twice ); utf8::decode( my $dec_twice = $dec_once ); say('length = ', length($dec_twice)); say($orig eq $dec_twice ? 'ok' : 'not ok'); } { say("Using Encode::decode"); my $dec_once = Encode::decode('UTF-8', $enc_twice); my $dec_twice = Encode::decode('UTF-8', $dec_once); say('length = ', length($dec_twice)); say($orig eq $dec_twice ? 'ok' : 'not ok'); } [download] `length = 4 Using utf8::decode length = 4 ok Using Encode::decode length = 4 ok` [download] Works fine for me, both ways. I'll take your word for it that you are experiencing a problem, but I'm not going to comb through the hundreds of lines you posted to find what it is. If this doesn't help, please post a minimal demonstration of the problem.	[reply] [d/l] [select]
Re^2: utf8::decode vs. Encode::decode with regard to the length function by Anonymous Monk on Dec 03, 2010 at 00:01 UTC
Thanks for the response. Here's a more succinct example that demonstrates this issue in a different way. #!/usr/bin/perl + use strict; use warnings; use Encode qw( ); my $orig = "\xE8\xAB\x86\x0A"; utf8::encode( my $enc_once = $orig ); utf8::decode( $enc_once ); print('length after first decode= ', length($enc_once), "\n"); utf8::decode( $enc_once ); print('length after second decode= ', length($enc_once), "\n"); # do it again but don't check the intermediate length utf8::encode( $enc_once = $orig ); utf8::decode( $enc_once ); utf8::decode( $enc_once ); print('length after second decode= ', length($enc_once), "\n"); [download] Here's the output: length after first decode= 4 length after second decode= 4 length after second decode= 2 Apparently checking the length before decoding again changes the result of calling length again, which I don't understand.	[reply] [d/l]
Re^3: utf8::decode vs. Encode::decode with regard to the length function by dave_the_m (Monsignor) on Dec 03, 2010 at 12:32 UTC
That looks like a length-caching bug, and it's still present in blead. Can you perlbug this please? Dave.	[reply]
Re^4: utf8::decode vs. Encode::decode with regard to the length function by ikegami (Patriarch) on Dec 03, 2010 at 18:46 UTC
Re^3: utf8::decode vs. Encode::decode with regard to the length function by ikegami (Patriarch) on Dec 03, 2010 at 16:59 UTC
Yeah, it's a length-caching bug As a test script: `#!/usr/bin/perl use strict; use warnings; use Test::More tests => 8; { # Baseline. my $s = "\xE8\xAB\x86\x0A"; utf8::downgrade($s); is(length($s), 4); is($s, "\xE8\xAB\x86\x0A"); utf8::decode($s); is(length($s), 2); is($s, "\x{8AC6}\n"); } { # Check for length-caching bug. my $s = "\xE8\xAB\x86\x0A"; utf8::upgrade($s); is(length($s), 4); is($s, "\xE8\xAB\x86\x0A"); utf8::decode($s); is(length($s), 2); is($s, "\x{8AC6}\n"); } 1;` [download] `1..8 ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 not ok 7 # Failed test at a.pl line 15. # got: '4' # expected: '2' ok 8 # Looks like you failed 1 test of 8.` [download]	[reply] [d/l] [select]