question about Encode::decode('iso-8859-1', ...)

perl5ever has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: question about Encode::decode('iso-8859-1', ...) by leslie (Pilgrim) on Mar 07, 2009 at 05:16 UTC
Please gone through this below link. click here `$octets = encode("iso-8859-1", $string);` [download] When you run $octets = encode("utf8", $string) , then $octets may not be equal to $string. Though they both contain the same data, the UTF8 flag for $octets is always off. When you encode anything, UTF8 flag of the result is always off, even when it contains completely valid utf8 string. `$string = decode("iso-8859-1", $octets);` When you run $string = decode("utf8", $octets) , then $string may not be equal to $octets. Though they both contain the same data, the UTF8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines).	[reply] [d/l] [select]
Re: question about Encode::decode('iso-8859-1', ...) by massa (Hermit) on Mar 07, 2009 at 03:24 UTC
Yes. If `$x` is a string with any non-latin1 character ("`ς`", for instance)... then `test()` will croak 'wide char something something'... []s, HTH, Massa (κς,πμ,πλ)	[reply] [d/l] [select]
Re: question about Encode::decode('iso-8859-1', ...) by borisz (Canon) on Mar 07, 2009 at 02:08 UTC
No. But test($x) may return 1 if you use 'utf8' instead of 'iso-8859-1'. Boris	[reply]
Re^2: question about Encode::decode('iso-8859-1', ...) by perl5ever (Pilgrim) on Mar 07, 2009 at 02:23 UTC
thanks, but do you mean "may return 0 ..." instead of "may return 1 ..." ?	[reply]
Re^3: question about Encode::decode('iso-8859-1', ...) by ikegami (Patriarch) on Mar 07, 2009 at 05:42 UTC
Well, it can return both :) It can return zero for any encoding other than US-ASCII and iso-8859-1.	[reply]
Re^4: question about Encode::decode('iso-8859-1', ...) by Anonymous Monk on Mar 07, 2009 at 11:30 UTC
Re^5: question about Encode::decode('iso-8859-1', ...) by ikegami (Patriarch) on Mar 07, 2009 at 12:36 UTC
Some notes below your chosen depth have not been shown here
Re: question about Encode::decode('iso-8859-1', ...) by graff (Chancellor) on Mar 07, 2009 at 15:52 UTC
If by "a simple scalar (unblessed, etc.)" you mean "any numeric value, or any string value that does not have the utf8 flag set", then no, there is no value of $x for which test($x) returns 0. But as hinted by massa above, any scalar with the utf8 flag turned on will cause the script to die with a run-time error: `Wide character in subroutine entry at /.../Encode.pm line ...` [download] because you cannot "decode()" a string into perl-internal utf8 if it is already flagged as being perl-internal utf8. ~~There does seem to be some suggestion of discrepancy between the Encode man page and the behavior of "eq" and "ne";~~ the man page says: ...to convert ISO−8859−1 data to a string in Perl’s internal format: $string = decode("iso−8859−1", $octets); CAVEAT: When you run "$string = decode("utf8", $octets)", then $string may not be equal to $octets. Though they both contain the same data, the utf8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines). (Update: thanks to almut for catching/explaining how I misread this point.) But the following script (when run with perl 5.8.8 on darwin) shows that the flag setting seems to have no effect on "eq" for the characters in question (the "high table" portion of 8859-1) -- every output line says "(flag diff...) decoding ... makes no difference": `#!/usr/bin/perl use Encode qw/encode decode is_utf8/; for my $scalar ( map { encode( 'iso-8859-1', chr( $_ )) } 0xa0 .. 0xff + ) { printf( "decoding %s makes %s difference\n", $scalar, ( test( $scalar ) ? "no" : "some sort of" )); } sub test { my $x = shift; my $y = Encode::decode('iso-8859-1', $x); print "(flag diff...) " if ( is_utf8( $x ) ne is_utf8( $y )); if ($x eq $y) { return 1; } else { return 0; } }` [download] So I wonder whether there are any perl versions or installations where the caveat actually applies to "eq" and "ne", or whether there is some other comparison operator on my version/installation that would catch the difference in the flag setting.	[reply] [d/l] [select]
Re^2: question about Encode::decode('iso-8859-1', ...) by almut (Canon) on Mar 07, 2009 at 16:58 UTC
Whenever you mix a non-decoded (binary, i.e. octets) string with a text string (utf8 flag on), Perl will silently upgrade the binary string, assuming it's in ISO-8859-1 encoding. "Mixing" here refers to actions such as comparing (as with `eq`), regex matches, concatenation, etc. Thus, even though the strings `$x` and `$y` are different here with respect to their internal representation (as can be shown with Devel::Peek::Dump() — e.g. `a0` (`$x`, binary) vs. `c2 a0` (`$y`, utf8) ), this difference does not show up in the comparison result, because the binary string (`$x`) is implicitly upgraded for the comparison. Also, the caveat is talking about `decode("utf8",...)` (not `decode("iso-8859-1",...)`), so it doesn't really apply here anyway...	[reply] [d/l] [select]
Re^2: question about Encode::decode('iso-8859-1', ...) by ikegami (Patriarch) on Mar 07, 2009 at 22:45 UTC
because you cannot "decode()" a string into perl-internal utf8 if it is already flagged as being perl-internal utf8. Not true. It would be a bug if the internal representation mattered (by definition of internal), and the following demonstrates that it doesn't. `use strict; use warnings; use Encode qw( encode decode ); sub test { my ($enc, $orig) = @_; my $bin = encode($enc, $orig); utf8::downgrade my $bin_dn = $bin; # UTF8=0 utf8::upgrade my $bin_up = $bin; # UTF8=1 my $txt_dn = decode($enc, $bin_dn); my $txt_up = decode($enc, $bin_up); printf("%-11s %d %d\n", "$enc:", $txt_dn eq $orig ? 1 : 0, $txt_up eq $orig ? 1 : 0, ); } test('iso-8859-1', "A\x{E2}"); test('UTF-8', "A\x{E2}\x{2660}"); test('UTF-16le', "A\x{E2}\x{2660}");` [download] `iso-8859-1: 1 1 UTF-8: 1 1 UTF-16le: 1 1` [download] I think you were trying to say that it makes no sense to decode something that's already been decoded, but that's got nothing to do with whether it's a "perl-internal utf8" buffer or not. You can have a binary/encoded string with UTF8=0 You can have a binary/encoded string with UTF8=1 You can have a text/decoded string with UTF8=0 You can have a text/decoded string with UTF8=1	[reply] [d/l] [select]
Re^3: question about Encode::decode('iso-8859-1', ...) by graff (Chancellor) on Mar 08, 2009 at 04:05 UTC
I think you were trying to say that it makes no sense to decode something that's already been decoded, but that's got nothing to do with whether it's a "perl-internal utf8" buffer or not. Based on having seen the result of this snippet: `perl -MEncode -e '$x="\x{0432}"; $y=decode("utf8",$x)'` [download] my intention was to say that when you pass that sort of string to Encode::decode(), you get a run-time error. I had assumed that "that sort of string" was most easily understood as one whose utf8 flag was already on. You've shown that things are actually deeper and more complicated -- I added "is_utf8()" to your script, and confirmed that Encode::decode was working without croaking, with the input string's utf8 flag on as well as off. This is a surprising effect of the utf8::upgrade/downgrade functions, and I'm glad to know about it, though it goes a bit beyond the scope of the OP (and most applications that involve encoding issues).	[reply] [d/l]
Re^4: question about Encode::decode('iso-8859-1', ...) by ikegami (Patriarch) on Mar 08, 2009 at 04:53 UTC