If you are experiencing a failure when reading a file opened with :encoding(UTF-8) that you are not getting with the same file opened with :utf8, then what is happening is that the file does not meet the strict requirements of UTF‑8.

Currently, :utf8 is quite lax in what it will pass through. It doesn’t check for various naughtinesses. Because of that, you should never use it for input from untrusted sources, only for output.

There are some details of this toward the end of the Encode(3) manpage in the section “UTF‑8 vs. utf8 vs. UTF8”, which I provide in its entirety here below, with minor edits:


UTF‑8 vs. utf8 vs. UTF8

....We now view strings not as sequences of bytes, but as sequences of numbers in the range 0 .. 232−1 (or in the case of 64‑bit computers, 0 .. 264−1) — Programming Perl, 3rd ed.
That has historically been Perl’s notion of UTF‑8, as that is how UTF‑8 was first conceived by Ken Thompson when he invented it. However, thanks to later revisions to the applicable standards, official UTF‑8 is now rather stricter than that. For example, its range is much narrower (0 .. 0x10_FFFF to cover only a meagre 21 bits instead of 32 or 64 bits) and some sequences are not allowed (e.g., those used in surrogate pairs, the 31 non‐character code points 0xFDD0 .. 0xFDEF, the last two code points in any plane (0xXX_FFFE and 0xXX_FFFF), all non‐shortest encodings, etc.).

Now that is overruled by Larry Wall himself.

From: Larry Wall <larry@wall.org> Date: December 04, 2004 11:51:58 JST To: perl-unicode@perl.org Subject: Re: Make Encode.pm support the real UTF-8 Message-Id: <20041204025158.GA28754@wall.org> On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote: : I've no problem with 'utf8' being perl's unrestricted uft8 encoding, : but "UTF-8" is the name of the standard and should give the : corresponding behaviour. For what it's worth, that's how I've always kept them straight in my head. Also for what it's worth, Perl 6 will mostly default to strict but make it easy to switch back to lax. Larry
Do you copy? As of Perl 5.8.7, UTF‑8 means the strict, official UTF‐8, whereas utf8 means the liberal, lax version thereof. And Encode version 2.10 or later thus groks the difference between “UTF‑8” and “utf8”.
encode("utf8", "\x{FFFF_FFFF}", 1); # okay encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks
“UTF‑8” in the Encode module is actually a canonical name for “utf‑8–strict”. Yes, the hyphen between “UTF” and “8” is important, because without it, Encode goes “liberal”:
find_encoding("UTF-8")->name # is 'utf-8-strict' find_encoding("utf-8")->name # ditto. names are case insensitive find_encoding("utf_8")->name # ditto. "_" are treated as "-" find_encoding("UTF8")->name # is 'utf8'.

Does that help?


In reply to Re: different utf8 method = different behaviour? by tchrist
in thread different utf8 method = different behaviour? by erwan

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.