in reply to Re: question about Encode::decode('iso-8859-1', ...)
in thread question about Encode::decode('iso-8859-1', ...)

because you cannot "decode()" a string into perl-internal utf8 if it is already flagged as being perl-internal utf8.

Not true. It would be a bug if the internal representation mattered (by definition of internal), and the following demonstrates that it doesn't.

use strict; use warnings; use Encode qw( encode decode ); sub test { my ($enc, $orig) = @_; my $bin = encode($enc, $orig); utf8::downgrade my $bin_dn = $bin; # UTF8=0 utf8::upgrade my $bin_up = $bin; # UTF8=1 my $txt_dn = decode($enc, $bin_dn); my $txt_up = decode($enc, $bin_up); printf("%-11s %d %d\n", "$enc:", $txt_dn eq $orig ? 1 : 0, $txt_up eq $orig ? 1 : 0, ); } test('iso-8859-1', "A\x{E2}"); test('UTF-8', "A\x{E2}\x{2660}"); test('UTF-16le', "A\x{E2}\x{2660}");
iso-8859-1: 1 1 UTF-8: 1 1 UTF-16le: 1 1

I think you were trying to say that it makes no sense to decode something that's already been decoded, but that's got nothing to do with whether it's a "perl-internal utf8" buffer or not.

Replies are listed 'Best First'.
Re^3: question about Encode::decode('iso-8859-1', ...)
by graff (Chancellor) on Mar 08, 2009 at 04:05 UTC
    I think you were trying to say that it makes no sense to decode something that's already been decoded, but that's got nothing to do with whether it's a "perl-internal utf8" buffer or not.

    Based on having seen the result of this snippet:

    perl -MEncode -e '$x="\x{0432}"; $y=decode("utf8",$x)'
    my intention was to say that when you pass that sort of string to Encode::decode(), you get a run-time error. I had assumed that "that sort of string" was most easily understood as one whose utf8 flag was already on.

    You've shown that things are actually deeper and more complicated -- I added "is_utf8()" to your script, and confirmed that Encode::decode was working without croaking, with the input string's utf8 flag on as well as off.

    This is a surprising effect of the utf8::upgrade/downgrade functions, and I'm glad to know about it, though it goes a bit beyond the scope of the OP (and most applications that involve encoding issues).

      This is a surprising effect of the utf8::upgrade/downgrade functions

      What surprising effect? Their purpose is to convert a scalar's internal encoding, and I used them for that purpose.

      If it helps clear up some confusion, change

      utf8::downgrade my $bin_dn = $bin; # UTF8=0 utf8::upgrade my $bin_up = $bin; # UTF8=1

      to

      my $bin_dn = $bin; # UTF8=0 chop my $bin_up = $bin . "\x{2660}"; # UTF8=1

      Practical use for utf8::upgrade: Ensure "Unicode semantics" are used in regex matches. (But note that work is being done to remove such dependencies on this internal information.)

      Practical use for utf8::downgrade: Ensure a string is a string of bytes (only contains chars 0-255), such as in Encode::decode and in Net::SFTP::Foreign::write.