comment on

Perl has three internal storage formats for numbers: signed integer, unsigned integer and floating point number.

Similarly, Perl has two internal storage formats for strings (described below).

utf8::is_utf8 identifies the format used, and utf8::upgrade and utf8::downgrade convert how a string is stored internally.

use Devel::Peek qw( Dump );

my $s = chr(0xE9);

say length($s);              # 1
say $s eq "\xE9" ?1:0;       # 1
say utf8::is_utf8($s) ?1:0;  # 0
Dump($s);                    # PV contains E9

utf8::upgrade($s);

say length($s);              # 1 The string hasn't changed
say $s eq "\xE9" ?1:0;       # 1 
say utf8::is_utf8($s) ?1:0;  # 1 But it's now stored differently.
Dump($s);                    # PV contains C3 A9

utf8::downgrade($s);

say length($s);              # 1
say $s eq "\xE9" ?1:0;       # 1
say utf8::is_utf8($s) ?1:0;  # 0
Dump($s);                    # PV contains E9
[download]

"Downgraded" format

Identified by the SVf_UTF8 flag (returned by utf8::is_utf8($sv) in Perl and SvUTF8(sv) in C) being clear.

Each character (string element) is capable of storing an 8-bit value.

Great for bytes. Not so good for text.

Each character is stored as a single byte. This allows very efficient access of arbitrary characters and very efficient access of the length of the string (both O(1)).

"Upgraded" format

Identified by the SVf_UTF8 flag (returned by utf8::is_utf8($sv) in Perl and SvUTF8(sv) in C) being set.

Each character (string element) is capable of storing a 72-bit value (in theory), a 64-bit value (on builds with uvsize of 8) or a 32-bit value (on builds with uvsize of 4).

This is more than enough to store any Unicode Code Point.

Each character is stored as its utf8 encoding. utf8 is an proprietary extension of UTF-8. As a variable-length encoding, both accessing arbitrary characters and accessing the length of the string are very inefficient (O(N)), though Perl does attach the length of the string to the scalar when it becomes known, and it even attaches some character positions in some situations.

The Unicode Bug

Notice how I didn't say format X is used to store Y. That's because Perl imparts no semantics on the choice of storage format. Just like three stored as a signed integer and three stored as a floating point number both refer to the same number, strings consisting of the same characters but stored in different formats are still considered the same string (i.e. eq will return true).

However, some code (particularly XS modules, but even some builtin operators) intentionally or inadvertently impart meaning on the choice of internal storage format of strings. Code does that does this is said to be suffering from The Unicode Bug. utf8::upgrade and utf8::downgrade are useful when working with such buggy code.

Rmpz_import is such a function. Without knowing the details, switching to SvPVbyte* is a sensible solution. (This would mean you can't receive strings with characters larger than 255, though.) Other options include upgrading the string (SvPVutf8*) and handling both formats (by checking SvUTF8(sv)).

Seeking work! You can reach me at ikegami@adaelis.com

In reply to Re: What does utf8::upgrade actually do. by ikegami
in thread What does utf8::upgrade actually do. by syphilis

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.