Re: question about Encode::decode('iso-8859-1', ...)
by leslie (Pilgrim) on Mar 07, 2009 at 05:16 UTC
|
$octets = encode("iso-8859-1", $string);
When you run $octets = encode("utf8", $string) , then $octets may not be equal to $string. Though they both contain the same data, the UTF8 flag for $octets is always off. When you encode anything, UTF8 flag of the result is always off, even when it contains completely valid utf8 string.
$string = decode("iso-8859-1", $octets);
When you run $string = decode("utf8", $octets) , then $string may not be equal to $octets. Though they both contain the same data, the UTF8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines).
| [reply] [d/l] [select] |
Re: question about Encode::decode('iso-8859-1', ...)
by massa (Hermit) on Mar 07, 2009 at 03:24 UTC
|
Yes. If $x is a string with any non-latin1 character ("ς", for instance)... then test() will croak 'wide char something something'...
[]s, HTH, Massa (κς,πμ,πλ)
| [reply] [d/l] [select] |
Re: question about Encode::decode('iso-8859-1', ...)
by borisz (Canon) on Mar 07, 2009 at 02:08 UTC
|
No.
But test($x) may return 1 if you use 'utf8' instead of 'iso-8859-1'.
| [reply] |
|
|
thanks, but do you mean "may return 0 ..." instead of "may return 1 ..." ?
| [reply] |
|
|
| [reply] |
|
|
|
|
|
Re: question about Encode::decode('iso-8859-1', ...)
by graff (Chancellor) on Mar 07, 2009 at 15:52 UTC
|
If by "a simple scalar (unblessed, etc.)" you mean "any numeric value, or any string value that does not have the utf8 flag set", then no, there is no value of $x for which test($x) returns 0.
But as hinted by massa above, any scalar with the utf8 flag turned on will cause the script to die with a run-time error:
Wide character in subroutine entry at /.../Encode.pm line ...
because you cannot "decode()" a string into perl-internal utf8 if it is already flagged as being perl-internal utf8.
There does seem to be some suggestion of discrepancy between the Encode man page and the behavior of "eq" and "ne"; the man page says:
...to convert ISO−8859−1 data to a string in Perl’s internal format:
$string = decode("iso−8859−1", $octets);
CAVEAT: When you run "$string = decode("utf8", $octets)", then
$string may not be equal to $octets. Though they both contain the
same data, the utf8 flag for $string is on unless $octets entirely
consists of ASCII data (or EBCDIC on EBCDIC machines).
(Update: thanks to almut for catching/explaining how I misread this point.)
But the following script (when run with perl 5.8.8 on darwin) shows that the flag setting seems to have no effect on "eq" for the characters in question (the "high table" portion of 8859-1) -- every output line says "(flag diff...) decoding ... makes no difference":
#!/usr/bin/perl
use Encode qw/encode decode is_utf8/;
for my $scalar ( map { encode( 'iso-8859-1', chr( $_ )) } 0xa0 .. 0xff
+ ) {
printf( "decoding %s makes %s difference\n", $scalar,
( test( $scalar ) ? "no" : "some sort of" ));
}
sub test {
my $x = shift;
my $y = Encode::decode('iso-8859-1', $x);
print "(flag diff...) " if ( is_utf8( $x ) ne is_utf8( $y ));
if ($x eq $y) {
return 1;
} else {
return 0;
}
}
So I wonder whether there are any perl versions or installations where the caveat actually applies to "eq" and "ne", or whether there is some other comparison operator on my version/installation that would catch the difference in the flag setting. | [reply] [d/l] [select] |
|
|
Whenever you mix a non-decoded (binary, i.e. octets) string with a
text string (utf8 flag on), Perl will silently upgrade the binary
string, assuming it's in ISO-8859-1 encoding. "Mixing" here refers to actions such as
comparing (as with eq), regex matches, concatenation, etc.
Thus, even though the strings $x and $y are different here
with respect to their internal representation (as can be shown with
Devel::Peek::Dump() — e.g. a0 ($x, binary)
vs. c2 a0 ($y, utf8) ), this difference does not show up
in the comparison result, because the binary string ($x) is implicitly
upgraded for the comparison.
Also, the caveat is talking about decode("utf8",...) (not decode("iso-8859-1",...)), so it
doesn't really apply here anyway...
| [reply] [d/l] [select] |
|
|
use strict;
use warnings;
use Encode qw( encode decode );
sub test {
my ($enc, $orig) = @_;
my $bin = encode($enc, $orig);
utf8::downgrade my $bin_dn = $bin; # UTF8=0
utf8::upgrade my $bin_up = $bin; # UTF8=1
my $txt_dn = decode($enc, $bin_dn);
my $txt_up = decode($enc, $bin_up);
printf("%-11s %d %d\n",
"$enc:",
$txt_dn eq $orig ? 1 : 0,
$txt_up eq $orig ? 1 : 0,
);
}
test('iso-8859-1', "A\x{E2}");
test('UTF-8', "A\x{E2}\x{2660}");
test('UTF-16le', "A\x{E2}\x{2660}");
iso-8859-1: 1 1
UTF-8: 1 1
UTF-16le: 1 1
I think you were trying to say that it makes no sense to decode something that's already been decoded, but that's got nothing to do with whether it's a "perl-internal utf8" buffer or not.
- You can have a binary/encoded string with UTF8=0
- You can have a binary/encoded string with UTF8=1
- You can have a text/decoded string with UTF8=0
- You can have a text/decoded string with UTF8=1
| [reply] [d/l] [select] |
|
|
perl -MEncode -e '$x="\x{0432}"; $y=decode("utf8",$x)'
my intention was to say that when you pass that sort of string to Encode::decode(), you get a run-time error. I had assumed that "that sort of string" was most easily understood as one whose utf8 flag was already on.
You've shown that things are actually deeper and more complicated -- I added "is_utf8()" to your script, and confirmed that Encode::decode was working without croaking, with the input string's utf8 flag on as well as off.
This is a surprising effect of the utf8::upgrade/downgrade functions, and I'm glad to know about it, though it goes a bit beyond the scope of the OP (and most applications that involve encoding issues). | [reply] [d/l] |
|
|