These are terms I use to refer to certain element or properties of Perl strings. I'm posting this as a reference, not to advocate them.
An element of a string, as in what substr($s,$i,1) returns. It's a 72-bit value in theory, but it's limited to the size of a UV in practice.
"Character" is the term used to document Perl functions that work on arbitrary strings (substr, index, reverse, chr, ord, etc) and it's the term used in Wikipedia's definition of "string".
The sequence of string elements in a string (irrespective of the choice of storage format used for that string).
A string element whose value is understood/expected to be in [0, 255] (irrespective of the choice of storage format used for that string).
A string element whose value is understood/expected to be a Unicode code point (irrespective of the choice of storage format used for that string).
A string whose value is understood/expected to be a sequence of values in [0, 255] (irrespective of the choice of storage format used for that string).
A string whose value is understood/expected to be a sequence of Unicode code points (irrespective of the choice of storage format used for that string).
The format of the PV in a string whose UTF8 flag is clear (0).
It's unambiguous, but it's quite a mouthful. Some use "byte string", but those who do tend to also use it for what I call "string of bytes".
The format of the PV in a string whose UTF8 flag is set (1).
It's unambiguous, but it's quite a mouthful. Some use "character string", but that's incorrect because all strings are made of characters by definition.
If changing the internal storage format of a string changes how a piece of code behaves, that code suffers from The Unicode Bug.
For example, the following code suffers from The Unicode Bug.
use feature qw( say ); use Inline C => <<'__EOI__'; STRLEN mylength(SV* sv) { STRLEN len; (void)SvPV(sv, len); return len; } __EOI__ $x="\xE9"; utf8::downgrade($x); $y="\xE9"; utf8::upgrade($y); say $x eq $y ? "equal" : "not equal"; # equal say mylength($x); # 1 say mylength($y); # 2
This usually refers to the UTF8=0 storage format, but it could also refere to a string of bytes.
This usually refers to the UTF8=1 storage format. The term is incorrect since all strings are made of characters by definition.
This usually refers to how code behaves when given a string in the UTF8=0 storage format, in distinction to how it behaves when given a string in the UTF8=1 storage format. Code that make such a distinction suffer from The Unicode Bug.
This usually refers to how code behaves when given a string in the UTF8=1 storage format, in distinction to how it behaves when given a string in the UTF8=0 storage format. Code that make such a distinction suffer from The Unicode Bug.
Update: Changed "regardless of the value of its UTF8 flag" to something clearer in response to JavaFan's and wrog's comments.
Update: By request, added end tags for DT, DD and P elements even though they are optional.
In reply to Jargon relating to Perl strings by ikegami
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |