Re^2: question about Encode::decode('iso-8859-1', ...)

because you cannot "decode()" a string into perl-internal utf8 if it is already flagged as being perl-internal utf8.

Not true. It would be a bug if the internal representation mattered (by definition of internal), and the following demonstrates that it doesn't.

use strict;
use warnings;

use Encode qw( encode decode );

sub test {
   my ($enc, $orig) = @_;

   my $bin = encode($enc, $orig);

   utf8::downgrade my $bin_dn = $bin;  # UTF8=0
   utf8::upgrade   my $bin_up = $bin;  # UTF8=1

   my $txt_dn = decode($enc, $bin_dn);
   my $txt_up = decode($enc, $bin_up);

   printf("%-11s %d %d\n",
      "$enc:",
      $txt_dn eq $orig ? 1 : 0,
      $txt_up eq $orig ? 1 : 0,
   );
}

test('iso-8859-1', "A\x{E2}");
test('UTF-8',      "A\x{E2}\x{2660}");
test('UTF-16le',   "A\x{E2}\x{2660}");
[download]

iso-8859-1: 1 1
UTF-8:      1 1
UTF-16le:   1 1
[download]

I think you were trying to say that it makes no sense to decode something that's already been decoded, but that's got nothing to do with whether it's a "perl-internal utf8" buffer or not.

You can have a binary/encoded string with UTF8=0
You can have a binary/encoded string with UTF8=1
You can have a text/decoded string with UTF8=0
You can have a text/decoded string with UTF8=1

Comment on Re^2: question about Encode::decode('iso-8859-1', ...) Select or Download Code

Replies are listed 'Best First'.
Re^3: question about Encode::decode('iso-8859-1', ...) by graff (Chancellor) on Mar 08, 2009 at 04:05 UTC
I think you were trying to say that it makes no sense to decode something that's already been decoded, but that's got nothing to do with whether it's a "perl-internal utf8" buffer or not. Based on having seen the result of this snippet: `perl -MEncode -e '$x="\x{0432}"; $y=decode("utf8",$x)'` [download] my intention was to say that when you pass that sort of string to Encode::decode(), you get a run-time error. I had assumed that "that sort of string" was most easily understood as one whose utf8 flag was already on. You've shown that things are actually deeper and more complicated -- I added "is_utf8()" to your script, and confirmed that Encode::decode was working without croaking, with the input string's utf8 flag on as well as off. This is a surprising effect of the utf8::upgrade/downgrade functions, and I'm glad to know about it, though it goes a bit beyond the scope of the OP (and most applications that involve encoding issues).	[reply] [d/l]
Re^4: question about Encode::decode('iso-8859-1', ...) by ikegami (Patriarch) on Mar 08, 2009 at 04:53 UTC
This is a surprising effect of the utf8::upgrade/downgrade functions What surprising effect? Their purpose is to convert a scalar's internal encoding, and I used them for that purpose. If it helps clear up some confusion, change `utf8::downgrade my $bin_dn = $bin; # UTF8=0 utf8::upgrade my $bin_up = $bin; # UTF8=1` [download] to `my $bin_dn = $bin; # UTF8=0 chop my $bin_up = $bin . "\x{2660}"; # UTF8=1` [download] Practical use for `utf8::upgrade`: Ensure "Unicode semantics" are used in regex matches. (But note that work is being done to remove such dependencies on this internal information.) Practical use for `utf8::downgrade`: Ensure a string is a string of bytes (only contains chars 0-255), such as in `Encode::decode` and in `Net::SFTP::Foreign::write`.	[reply] [d/l] [select]