in reply to utf8::decode vs. Encode::decode with regard to the length function

use strict; use warnings; use feature qw( say ); use Encode qw( ); my $orig = "\xE8\xAB\x86\x0A"; utf8::encode( my $enc_once = $orig ); utf8::encode( my $enc_twice = $enc_once ); say('length = ', length($orig)); { say("Using utf8::decode"); utf8::decode( my $dec_once = $enc_twice ); utf8::decode( my $dec_twice = $dec_once ); say('length = ', length($dec_twice)); say($orig eq $dec_twice ? 'ok' : 'not ok'); } { say("Using Encode::decode"); my $dec_once = Encode::decode('UTF-8', $enc_twice); my $dec_twice = Encode::decode('UTF-8', $dec_once); say('length = ', length($dec_twice)); say($orig eq $dec_twice ? 'ok' : 'not ok'); }
length = 4 Using utf8::decode length = 4 ok Using Encode::decode length = 4 ok

Works fine for me, both ways.

I'll take your word for it that you are experiencing a problem, but I'm not going to comb through the hundreds of lines you posted to find what it is. If this doesn't help, please post a minimal demonstration of the problem.

Replies are listed 'Best First'.
Re^2: utf8::decode vs. Encode::decode with regard to the length function
by Anonymous Monk on Dec 03, 2010 at 00:01 UTC

    Thanks for the response. Here's a more succinct example that demonstrates this issue in a different way.

    #!/usr/bin/perl + use strict; use warnings; use Encode qw( ); my $orig = "\xE8\xAB\x86\x0A"; utf8::encode( my $enc_once = $orig ); utf8::decode( $enc_once ); print('length after first decode= ', length($enc_once), "\n"); utf8::decode( $enc_once ); print('length after second decode= ', length($enc_once), "\n"); # do it again but don't check the intermediate length utf8::encode( $enc_once = $orig ); utf8::decode( $enc_once ); utf8::decode( $enc_once ); print('length after second decode= ', length($enc_once), "\n");

    Here's the output:

    length after first decode= 4
    
    length after second decode= 4
    
    length after second decode= 2
    
    

    Apparently checking the length before decoding again changes the result of calling length again, which I don't understand.

      That looks like a length-caching bug, and it's still present in blead. Can you perlbug this please?

      Dave.

      Yeah, it's a length-caching bug

      As a test script:

      #!/usr/bin/perl use strict; use warnings; use Test::More tests => 8; { # Baseline. my $s = "\xE8\xAB\x86\x0A"; utf8::downgrade($s); is(length($s), 4); is($s, "\xE8\xAB\x86\x0A"); utf8::decode($s); is(length($s), 2); is($s, "\x{8AC6}\n"); } { # Check for length-caching bug. my $s = "\xE8\xAB\x86\x0A"; utf8::upgrade($s); is(length($s), 4); is($s, "\xE8\xAB\x86\x0A"); utf8::decode($s); is(length($s), 2); is($s, "\x{8AC6}\n"); } 1;
      1..8 ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 not ok 7 # Failed test at a.pl line 15. # got: '4' # expected: '2' ok 8 # Looks like you failed 1 test of 8.