If you are experiencing a failure when reading a file opened with :encoding(UTF-8) that you are not getting with the same file opened with :utf8, then what is happening is that the file does not meet the strict requirements of UTF‑8. Currently, :utf8 is quite lax in what it will pass through. It doesn’t check for various naughtinesses. Because of that, you should never use it for input from untrusted sources, only for output.
There are some details of this toward the end of the Encode(3) manpage in the section “UTF‑8 vs. utf8 vs. UTF8”, which I provide in its entirety here below, with minor edits:
UTF‑8 vs. utf8 vs. UTF8
....We now view strings not as sequences of bytes, but as sequences
of numbers in the range 0 .. 232−1 (or in the case of 64‑bit
computers, 0 .. 264−1) — Programming Perl, 3rd ed.
That has historically been Perl’s notion of UTF‑8, as that is how UTF‑8 was first conceived by Ken Thompson when he invented it. However, thanks to later revisions to the applicable standards, official
UTF‑8 is now rather stricter than that. For example, its range is much narrower (0 ..
0x10_FFFF to cover only a meagre 21 bits instead of 32 or 64 bits) and some sequences are not allowed (e.g., those used
in surrogate pairs, the 31 non‐character code points 0xFDD0 .. 0xFDEF, the last two code points in any plane (0xXX_FFFE and 0xXX_FFFF), all non‐shortest encodings, etc.).
Now that is overruled by Larry Wall himself.
From: Larry Wall <larry@wall.org>
Date: December 04, 2004 11:51:58 JST
To: perl-unicode@perl.org
Subject: Re: Make Encode.pm support the real UTF-8
Message-Id: <20041204025158.GA28754@wall.org>
On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote:
: I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
: but "UTF-8" is the name of the standard and should give the
: corresponding behaviour.
For what it's worth, that's how I've always kept them straight in my
head.
Also for what it's worth, Perl 6 will mostly default to strict but
make it easy to switch back to lax.
Larry
Do you copy? As of Perl 5.8.7, UTF‑8 means the strict,
official UTF‐8, whereas utf8 means the liberal, lax version
thereof. And Encode version 2.10 or later thus groks the
difference between “UTF‑8” and “utf8”.encode("utf8", "\x{FFFF_FFFF}", 1); # okay
encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks
“UTF‑8” in the Encode module is actually a canonical name for
“utf‑8–strict”. Yes, the hyphen between “UTF” and “8” is
important, because without it, Encode goes “liberal”:
find_encoding("UTF-8")->name # is 'utf-8-strict'
find_encoding("utf-8")->name # ditto. names are case insensitive
find_encoding("utf_8")->name # ditto. "_" are treated as "-"
find_encoding("UTF8")->name # is 'utf8'.
Does that help? | [reply] [d/l] [select] |
Thank you all, it helps indeed.
I must admit I did not read all documentation pages related to encodings, and I was not aware of these subtle differences between ':utf8' and ':encoding(utf8)' as well as between 'utf8' and 'utf-8'. Now I understand why this difference appears. conclusion: using ':encoding(utf-8)' is safer.
| [reply] |
:utf8 and :encoding(UTF-8) do indeed different things - the former does less strict checking, which is why the latter is recommended. See also: UTF8 related proof of concept exploit released at T-DOSE.
Update: to encourage the better practice of using :encoding(UTF-8) by default, the perl 5 porters have changed the core documentation in nearly every place - except where :utf8 itself is explained.
| [reply] [d/l] [select] |
If you use binmode, the problem goes away. For example:
#!/usr/bin/perl
use strict;
use warnings;
binmode STDOUT, ':encoding(utf8)';
my $file = '/root/Desktop/russian';
open FILE, "<:utf8", $file or die $!;
my (@data1) = <FILE>;
close(FILE);
use open(':encoding(utf8)');
open( FILE, $file ) or die "can not open $file";
my (@data2) = <FILE>;
close(FILE);
die "different size" if scalar @data1 != scalar @data2;
while (@data1) {
my $s1 = shift @data1;
my $s2 = shift @data2;
print "1: $s1\n2: $s2\n";
die "different data" if $s1 ne $s2;
}
| [reply] [d/l] |
| [reply] |
erwan:
What's the difference between @data1 & @data2? Perhaps comparing the hexdump of the values may yield a clue or two.
...roboticus
When your only tool is a hammer, all problems look like your thumb.
| [reply] |