comment on

These are terms I use to refer to certain element or properties of Perl strings. I'm posting this as a reference, not to advocate them.

Basics

Character, or string element

An element of a string, as in what substr($s,$i,1) returns. It's a 72-bit value in theory, but it's limited to the size of a UV in practice.

"Character" is the term used to document Perl functions that work on arbitrary strings (substr, index, reverse, chr, ord, etc) and it's the term used in Wikipedia's definition of "string".

String value

The sequence of string elements in a string (irrespective of the choice of storage format used for that string).

String element semantics

Byte: A string element whose value is understood/expected to be in [0, 255] (irrespective of the choice of storage format used for that string).
Code point, or Unicode code point: A string element whose value is understood/expected to be a Unicode code point (irrespective of the choice of storage format used for that string).

String semantics

Bytes, or string of bytes: A string whose value is understood/expected to be a sequence of values in [0, 255] (irrespective of the choice of storage format used for that string).
Text, or decoded text: A string whose value is understood/expected to be a sequence of Unicode code points (irrespective of the choice of storage format used for that string).

String storage formats

UTF8=0 storage format

The format of the PV in a string whose UTF8 flag is clear (0).

It's unambiguous, but it's quite a mouthful. Some use "byte string", but those who do tend to also use it for what I call "string of bytes".

UTF8=1 storage format

The format of the PV in a string whose UTF8 flag is set (1).

It's unambiguous, but it's quite a mouthful. Some use "character string", but that's incorrect because all strings are made of characters by definition.

Other

The Unicode Bug

If changing the internal storage format of a string changes how a piece of code behaves, that code suffers from The Unicode Bug.

For example, the following code suffers from The Unicode Bug.

use feature qw( say );

use Inline C => <<'__EOI__';

   STRLEN mylength(SV* sv) {
      STRLEN len;
      (void)SvPV(sv, len);
      return len;
   }

__EOI__

$x="\xE9"; utf8::downgrade($x);
$y="\xE9"; utf8::upgrade($y);

say $x eq $y ? "equal" : "not equal";    # equal
say mylength($x);                        # 1
say mylength($y);                        # 2
[download]

Others related terms I've seen used

Byte string: This usually refers to the UTF8=0 storage format, but it could also refere to a string of bytes.
Character string: This usually refers to the UTF8=1 storage format. The term is incorrect since all strings are made of characters by definition.
Byte semantics: This usually refers to how code behaves when given a string in the UTF8=0 storage format, in distinction to how it behaves when given a string in the UTF8=1 storage format. Code that make such a distinction suffer from The Unicode Bug.
Character semantics: This usually refers to how code behaves when given a string in the UTF8=1 storage format, in distinction to how it behaves when given a string in the UTF8=0 storage format. Code that make such a distinction suffer from The Unicode Bug.

Update: Changed "regardless of the value of its UTF8 flag" to something clearer in response to JavaFan's and wrog's comments.
Update: By request, added end tags for DT, DD and P elements even though they are optional.

In reply to Jargon relating to Perl strings by ikegami

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.