erwan has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I don't know what to think about the following observation: I was tracking a bug which turns out to be caused by some invalid utf8 characters in my data. More precisely it is due to the fact that the files were not opened in the same way (see below) in two distinct modules. So I solved my bug but I wonder whether this difference is intended or not, and where to report it if necessary?

Here is the (simplified) code:

my $file="simple.txt"; open(FILE, "<:utf8", $file) or die "can not open $file"; my @data1 = <FILE>; close(FILE); use open ':encoding(utf8)'; open(FILE, $file) or die "can not open $file"; my @data2= <FILE>; close(FILE); die "different size" if (scalar(@data1) != scalar(@data2)); while (@data1) { my $s1 = shift(@data1); my $s2 = shift(@data2); # print "1: $s1\n2: $s2\n"; die "different data" if ($s1 ne $s2); }

and here is the output with my invalid UTF8 data:

utf8 "\xD0" does not map to Unicode at ./essai.pl line 8, <FILE> line +1. different data at ./essai.pl line 21.

(I disabled the print because I can not write the russian chars here - anyway the faulty character is not visible)

Thanks!

Replies are listed 'Best First'.
Re: different utf8 method = different behaviour?
by tchrist (Pilgrim) on May 01, 2011 at 15:12 UTC
    If you are experiencing a failure when reading a file opened with :encoding(UTF-8) that you are not getting with the same file opened with :utf8, then what is happening is that the file does not meet the strict requirements of UTF‑8.

    Currently, :utf8 is quite lax in what it will pass through. It doesn’t check for various naughtinesses. Because of that, you should never use it for input from untrusted sources, only for output.

    There are some details of this toward the end of the Encode(3) manpage in the section “UTF‑8 vs. utf8 vs. UTF8”, which I provide in its entirety here below, with minor edits:


    UTF‑8 vs. utf8 vs. UTF8

    ....We now view strings not as sequences of bytes, but as sequences of numbers in the range 0 .. 232−1 (or in the case of 64‑bit computers, 0 .. 264−1) — Programming Perl, 3rd ed.
    That has historically been Perl’s notion of UTF‑8, as that is how UTF‑8 was first conceived by Ken Thompson when he invented it. However, thanks to later revisions to the applicable standards, official UTF‑8 is now rather stricter than that. For example, its range is much narrower (0 .. 0x10_FFFF to cover only a meagre 21 bits instead of 32 or 64 bits) and some sequences are not allowed (e.g., those used in surrogate pairs, the 31 non‐character code points 0xFDD0 .. 0xFDEF, the last two code points in any plane (0xXX_FFFE and 0xXX_FFFF), all non‐shortest encodings, etc.).

    Now that is overruled by Larry Wall himself.

    From: Larry Wall <larry@wall.org> Date: December 04, 2004 11:51:58 JST To: perl-unicode@perl.org Subject: Re: Make Encode.pm support the real UTF-8 Message-Id: <20041204025158.GA28754@wall.org> On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote: : I've no problem with 'utf8' being perl's unrestricted uft8 encoding, : but "UTF-8" is the name of the standard and should give the : corresponding behaviour. For what it's worth, that's how I've always kept them straight in my head. Also for what it's worth, Perl 6 will mostly default to strict but make it easy to switch back to lax. Larry
    Do you copy? As of Perl 5.8.7, UTF‑8 means the strict, official UTF‐8, whereas utf8 means the liberal, lax version thereof. And Encode version 2.10 or later thus groks the difference between “UTF‑8” and “utf8”.
    encode("utf8", "\x{FFFF_FFFF}", 1); # okay encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks
    “UTF‑8” in the Encode module is actually a canonical name for “utf‑8–strict”. Yes, the hyphen between “UTF” and “8” is important, because without it, Encode goes “liberal”:
    find_encoding("UTF-8")->name # is 'utf-8-strict' find_encoding("utf-8")->name # ditto. names are case insensitive find_encoding("utf_8")->name # ditto. "_" are treated as "-" find_encoding("UTF8")->name # is 'utf8'.

    Does that help?

      Thank you all, it helps indeed. I must admit I did not read all documentation pages related to encodings, and I was not aware of these subtle differences between ':utf8' and ':encoding(utf8)' as well as between 'utf8' and 'utf-8'. Now I understand why this difference appears. conclusion: using ':encoding(utf-8)' is safer.

Re: different utf8 method = different behaviour?
by moritz (Cardinal) on May 01, 2011 at 15:06 UTC
Re: different utf8 method = different behaviour?
by Khen1950fx (Canon) on May 01, 2011 at 14:29 UTC
    If you use binmode, the problem goes away. For example:
    #!/usr/bin/perl use strict; use warnings; binmode STDOUT, ':encoding(utf8)'; my $file = '/root/Desktop/russian'; open FILE, "<:utf8", $file or die $!; my (@data1) = <FILE>; close(FILE); use open(':encoding(utf8)'); open( FILE, $file ) or die "can not open $file"; my (@data2) = <FILE>; close(FILE); die "different size" if scalar @data1 != scalar @data2; while (@data1) { my $s1 = shift @data1; my $s2 = shift @data2; print "1: $s1\n2: $s2\n"; die "different data" if $s1 ne $s2; }

      Actually... no. with my data I still have the same output (the program still dies with the "different data" message).

        erwan:

        What's the difference between @data1 & @data2? Perhaps comparing the hexdump of the values may yield a clue or two.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.