Currently, :utf8 is quite lax in what it will pass through. It doesn’t check for various naughtinesses. Because of that, you should never use it for input from untrusted sources, only for output.
There are some details of this toward the end of the Encode(3) manpage in the section “UTF‑8 vs. utf8 vs. UTF8”, which I provide in its entirety here below, with minor edits:
UTF‑8 vs. utf8 vs. UTF8
....We now view strings not as sequences of bytes, but as sequences of numbers in the range 0 .. 232−1 (or in the case of 64‑bit computers, 0 .. 264−1) — Programming Perl, 3rd ed.That has historically been Perl’s notion of UTF‑8, as that is how UTF‑8 was first conceived by Ken Thompson when he invented it. However, thanks to later revisions to the applicable standards, official UTF‑8 is now rather stricter than that. For example, its range is much narrower (0 .. 0x10_FFFF to cover only a meagre 21 bits instead of 32 or 64 bits) and some sequences are not allowed (e.g., those used in surrogate pairs, the 31 non‐character code points 0xFDD0 .. 0xFDEF, the last two code points in any plane (0xXX_FFFE and 0xXX_FFFF), all non‐shortest encodings, etc.).Now that is overruled by Larry Wall himself.
Do you copy? As of Perl 5.8.7, UTF‑8 means the strict, official UTF‐8, whereas utf8 means the liberal, lax version thereof. And Encode version 2.10 or later thus groks the difference between “UTF‑8” and “utf8”.From: Larry Wall <larry@wall.org> Date: December 04, 2004 11:51:58 JST To: perl-unicode@perl.org Subject: Re: Make Encode.pm support the real UTF-8 Message-Id: <20041204025158.GA28754@wall.org> On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote: : I've no problem with 'utf8' being perl's unrestricted uft8 encoding, : but "UTF-8" is the name of the standard and should give the : corresponding behaviour. For what it's worth, that's how I've always kept them straight in my head. Also for what it's worth, Perl 6 will mostly default to strict but make it easy to switch back to lax. Larry“UTF‑8” in the Encode module is actually a canonical name for “utf‑8–strict”. Yes, the hyphen between “UTF” and “8” is important, because without it, Encode goes “liberal”:encode("utf8", "\x{FFFF_FFFF}", 1); # okay encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaksfind_encoding("UTF-8")->name # is 'utf-8-strict' find_encoding("utf-8")->name # ditto. names are case insensitive find_encoding("utf_8")->name # ditto. "_" are treated as "-" find_encoding("UTF8")->name # is 'utf8'.
Does that help?
In reply to Re: different utf8 method = different behaviour?
by tchrist
in thread different utf8 method = different behaviour?
by erwan
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |