in reply to question about Encode::decode('iso-8859-1', ...)

If by "a simple scalar (unblessed, etc.)" you mean "any numeric value, or any string value that does not have the utf8 flag set", then no, there is no value of $x for which test($x) returns 0.

But as hinted by massa above, any scalar with the utf8 flag turned on will cause the script to die with a run-time error:

Wide character in subroutine entry at /.../Encode.pm line ...
because you cannot "decode()" a string into perl-internal utf8 if it is already flagged as being perl-internal utf8.

There does seem to be some suggestion of discrepancy between the Encode man page and the behavior of "eq" and "ne"; the man page says:

...to convert ISO−8859−1 data to a string in Perl’s internal format:

$string = decode("iso−8859−1", $octets);

CAVEAT: When you run "$string = decode("utf8", $octets)", then $string may not be equal to $octets. Though they both contain the same data, the utf8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines).

(Update: thanks to almut for catching/explaining how I misread this point.)

But the following script (when run with perl 5.8.8 on darwin) shows that the flag setting seems to have no effect on "eq" for the characters in question (the "high table" portion of 8859-1) -- every output line says "(flag diff...) decoding ... makes no difference":

#!/usr/bin/perl use Encode qw/encode decode is_utf8/; for my $scalar ( map { encode( 'iso-8859-1', chr( $_ )) } 0xa0 .. 0xff + ) { printf( "decoding %s makes %s difference\n", $scalar, ( test( $scalar ) ? "no" : "some sort of" )); } sub test { my $x = shift; my $y = Encode::decode('iso-8859-1', $x); print "(flag diff...) " if ( is_utf8( $x ) ne is_utf8( $y )); if ($x eq $y) { return 1; } else { return 0; } }
So I wonder whether there are any perl versions or installations where the caveat actually applies to "eq" and "ne", or whether there is some other comparison operator on my version/installation that would catch the difference in the flag setting.

Replies are listed 'Best First'.
Re^2: question about Encode::decode('iso-8859-1', ...)
by almut (Canon) on Mar 07, 2009 at 16:58 UTC

    Whenever you mix a non-decoded (binary, i.e. octets) string with a text string (utf8 flag on), Perl will silently upgrade the binary string, assuming it's in ISO-8859-1 encoding. "Mixing" here refers to actions such as comparing (as with eq), regex matches, concatenation, etc.

    Thus, even though the strings $x and $y are different here with respect to their internal representation (as can be shown with Devel::Peek::Dump() — e.g. a0 ($x, binary) vs. c2 a0 ($y, utf8) ), this difference does not show up in the comparison result, because the binary string ($x) is implicitly upgraded for the comparison.

    Also, the caveat is talking about decode("utf8",...) (not decode("iso-8859-1",...)), so it doesn't really apply here anyway...

Re^2: question about Encode::decode('iso-8859-1', ...)
by ikegami (Patriarch) on Mar 07, 2009 at 22:45 UTC

    because you cannot "decode()" a string into perl-internal utf8 if it is already flagged as being perl-internal utf8.

    Not true. It would be a bug if the internal representation mattered (by definition of internal), and the following demonstrates that it doesn't.

    use strict; use warnings; use Encode qw( encode decode ); sub test { my ($enc, $orig) = @_; my $bin = encode($enc, $orig); utf8::downgrade my $bin_dn = $bin; # UTF8=0 utf8::upgrade my $bin_up = $bin; # UTF8=1 my $txt_dn = decode($enc, $bin_dn); my $txt_up = decode($enc, $bin_up); printf("%-11s %d %d\n", "$enc:", $txt_dn eq $orig ? 1 : 0, $txt_up eq $orig ? 1 : 0, ); } test('iso-8859-1', "A\x{E2}"); test('UTF-8', "A\x{E2}\x{2660}"); test('UTF-16le', "A\x{E2}\x{2660}");
    iso-8859-1: 1 1 UTF-8: 1 1 UTF-16le: 1 1

    I think you were trying to say that it makes no sense to decode something that's already been decoded, but that's got nothing to do with whether it's a "perl-internal utf8" buffer or not.

    • You can have a binary/encoded string with UTF8=0
    • You can have a binary/encoded string with UTF8=1
    • You can have a text/decoded string with UTF8=0
    • You can have a text/decoded string with UTF8=1
      I think you were trying to say that it makes no sense to decode something that's already been decoded, but that's got nothing to do with whether it's a "perl-internal utf8" buffer or not.

      Based on having seen the result of this snippet:

      perl -MEncode -e '$x="\x{0432}"; $y=decode("utf8",$x)'
      my intention was to say that when you pass that sort of string to Encode::decode(), you get a run-time error. I had assumed that "that sort of string" was most easily understood as one whose utf8 flag was already on.

      You've shown that things are actually deeper and more complicated -- I added "is_utf8()" to your script, and confirmed that Encode::decode was working without croaking, with the input string's utf8 flag on as well as off.

      This is a surprising effect of the utf8::upgrade/downgrade functions, and I'm glad to know about it, though it goes a bit beyond the scope of the OP (and most applications that involve encoding issues).

        This is a surprising effect of the utf8::upgrade/downgrade functions

        What surprising effect? Their purpose is to convert a scalar's internal encoding, and I used them for that purpose.

        If it helps clear up some confusion, change

        utf8::downgrade my $bin_dn = $bin; # UTF8=0 utf8::upgrade my $bin_up = $bin; # UTF8=1

        to

        my $bin_dn = $bin; # UTF8=0 chop my $bin_up = $bin . "\x{2660}"; # UTF8=1

        Practical use for utf8::upgrade: Ensure "Unicode semantics" are used in regex matches. (But note that work is being done to remove such dependencies on this internal information.)

        Practical use for utf8::downgrade: Ensure a string is a string of bytes (only contains chars 0-255), such as in Encode::decode and in Net::SFTP::Foreign::write.