comment on

Also keep in mind that a perl "string" does not need to be a single encoding for all of its content.

Think XML and CSV where parts can be real binary and parts can be encoded.

Upgrading/downgrading the complete string before processing (either in pure perl or in XS) will cause data-corruption.

One more thing to keep in mind with codepoints is that Unicode allows a lot.

e.g. U+001e2f (LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE) can be encoded in UTF-8 as e1 b8 af, c3 af cc 81, c3 ad cc 88, 69 cc 81 cc 88, or 69 cc 88 cc 81, all representing the same glyph. At the moment of writing, perl does not alter any of that, but when Unicode Normalization rules would apply on a semantic level, your world view changes:

(update: I meanwhile learned that the order of the diacriticals can be meaningful, which is why two of the examples below do not normalize to U+001e2f)

#!/usr/bin/perl

use 5.18.2;
use warnings;

use Data::Peek;
use Unicode::Normalize qw( normalize );
use Encode             qw( encode decode );
use charnames          qw(:full);

sub dp {
    my ($tag, $dta) = @_;
    my $dp = DPeek ($dta);
    printf "%-6s: %-52s", $tag, $dp =~ s{^(\S+)\K}{" " x (26 - length 
+$1)}er;
    utf8::is_utf8 ($dta) and
        print join " + " => map { charnames::viacode (ord) } split // 
+=> $dta;
    say "";
    } # dp

$| = 1;
foreach my $bytes (
        "\xe1\xb8\xaf",
        "\xc3\xaf\xcc\x81",
        "\xc3\xad\xcc\x88",
        "\x69\xcc\x81\xcc\x88",
        "\x69\xcc\x88\xcc\x81",
        ) {
    my $u = decode ("utf-8", $bytes);
    dp ("Bytes", $bytes);
    dp ("UTF-8", $u);
    dp ("NF$_", normalize ($_, $u)) for qw( D C KD KC );
    say "";
    }
[download]

Bytes : PV("\341\270\257"\0)
UTF-8 : PV("\341\270\257"\0)       [UTF8 "\x{1e2f}"]        LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
NFD   : PV("i\314\210\314\201"\0)  [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFC   : PV("\341\270\257"\0)       [UTF8 "\x{1e2f}"]        LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
NFKD  : PV("i\314\210\314\201"\0)  [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFKC  : PV("\341\270\257"\0)       [UTF8 "\x{1e2f}"]        LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE

Bytes : PV("\303\257\314\201"\0)
UTF-8 : PV("\303\257\314\201"\0)   [UTF8 "\x{ef}\x{301}"]   LATIN SMAL
+L LETTER I WITH DIAERESIS + COMBINING ACUTE ACCENT
NFD   : PV("i\314\210\314\201"\0)  [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFC   : PV("\341\270\257"\0)       [UTF8 "\x{1e2f}"]        LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
NFKD  : PV("i\314\210\314\201"\0)  [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFKC  : PV("\341\270\257"\0)       [UTF8 "\x{1e2f}"]        LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE

Bytes : PV("\303\255\314\210"\0)
UTF-8 : PV("\303\255\314\210"\0)   [UTF8 "\x{ed}\x{308}"]   LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS
NFD   : PV("i\314\201\314\210"\0)  [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFC   : PV("\303\255\314\210"\0)   [UTF8 "\x{ed}\x{308}"]   LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS
NFKD  : PV("i\314\201\314\210"\0)  [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFKC  : PV("\303\255\314\210"\0)   [UTF8 "\x{ed}\x{308}"]   LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS

Bytes : PV("i\314\201\314\210"\0)
UTF-8 : PV("i\314\201\314\210"\0)  [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFD   : PV("i\314\201\314\210"\0)  [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFC   : PV("\303\255\314\210"\0)   [UTF8 "\x{ed}\x{308}"]   LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS
NFKD  : PV("i\314\201\314\210"\0)  [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFKC  : PV("\303\255\314\210"\0)   [UTF8 "\x{ed}\x{308}"]   LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS

Bytes : PV("i\314\210\314\201"\0)
UTF-8 : PV("i\314\210\314\201"\0)  [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFD   : PV("i\314\210\314\201"\0)  [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFC   : PV("\341\270\257"\0)       [UTF8 "\x{1e2f}"]        LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
NFKD  : PV("i\314\210\314\201"\0)  [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFKC  : PV("\341\270\257"\0)       [UTF8 "\x{1e2f}"]        LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
[download]

Enjoy, Have FUN! H.Merijn

In reply to Re: What does utf8::upgrade actually do. by Tux
in thread What does utf8::upgrade actually do. by syphilis

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.