How does the built-in function length work?

PerlOnTheWay has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How does the built-in function length work? by moritz (Cardinal) on Dec 02, 2011 at 14:42 UTC
There are two ways that perl 5 stores strings. If the UTF8 flag is set, length returns the number of characters, not bytes. If the flag is not set, perl assumes that the encoding is ISO-8859-1, and there the number of bytes is equal to the number of characters. Perl 6 - second systems done right	[reply]
Re^2: How does the built-in function length work? by ikegami (Patriarch) on Dec 02, 2011 at 19:30 UTC
I usually always agree 100% with your posts, but not today. Perl operators that deal with text (regex and uc and the like) expect Unicode code points. (Formerly: Expected Unicode code points or ASCII depending on UTF8 flag.) Perl operators that deal with file names (open, stat, etc) expect the file names to be bytes. Perl never assumes or expects iso-8859-1.	[reply]
Re^3: How does the built-in function length work? by moritz (Cardinal) on Dec 02, 2011 at 21:02 UTC
(Formerly: Expected Unicode code points or ASCII depending on UTF8 flag.) I guess that what you call "Unicode code point" is what I call "ISO-8859-1". ISO-8859-1 is simply the encoding that maps the byte values from 0 to 255 to the Unicode codepoints from 0 to 255, in that order. Perl never assumes or expects iso-8859-1. `$ echo -e "\xE4"\|perl -wE 'say <> ~~ /\w/' 1 $ # this a perl 5.14.1` [download] Since no decoding step happened here, and `<>` is a binary operation, and the regex match a text operation, perl has to assume a character encoding. And that happens to be ISO-8859-1. Or what do you think it is, if not ISO-8859-1? Perl 6 - second systems done right	[reply] [d/l] [select]
Re^4: How does the built-in function length work? by JavaFan (Canon) on Dec 02, 2011 at 21:27 UTC
Re^4: How does the built-in function length work? by ikegami (Patriarch) on Dec 02, 2011 at 21:48 UTC
Re^5: How does the built-in function length work? by moritz (Cardinal) on Dec 03, 2011 at 06:22 UTC
Some notes below your chosen depth have not been shown here
Re: How does the built-in function length work? by Eliya (Vicar) on Dec 02, 2011 at 14:06 UTC
No, it works with decoded strings (i.e. it doesn't have to guess encodings).	[reply]
Re^2: How does the built-in function length work? by PerlOnTheWay (Monk) on Dec 02, 2011 at 14:12 UTC
But I don't have to decode the string before using it	[reply]
Re^3: How does the built-in function length work? by Corion (Patriarch) on Dec 02, 2011 at 14:21 UTC
Maybe in your string then, the number of octets and number of characters is the same? The following shows that Perl does not guess the encoding of strings but assumes it: `# Perl assumes it's a Latin-1 string > perl -MEncode -wle print+length(qq(\x{c3}\x{a4})) 2` [download] `# Perl gets told to decode the string from UTF-8 > perl -MEncode -wle print+length(decode('UTF-8',qq(\x{c3}\x{a4}))) 1` [download] `# My terminal is Latin-1, which happens to match Perls default assumpt +ion > perl -MEncode -wle print(length(decode('Latin-1',qq(ä)))) 1` [download] Update: choroba pointed out that I mispasted the second example - now corrected.	[reply] [d/l] [select]
Re^3: How does the built-in function length work? by Eliya (Vicar) on Dec 02, 2011 at 14:21 UTC
It depends... sometimes you do have to decode them, sometimes you don't, because Perl (or some module etc.) has already done it for you. In any case, for Perl to be able to work with character strings (as opposed to byte/octet strings), the string must have been decoded somehow into Perl's internal Unicode representation.	[reply]
Re^3: How does the built-in function length work? by Anonymous Monk on Dec 02, 2011 at 14:19 UTC
It also works on encoded strings, that is to say then it counts octets. Be aware of what you are feeding to the length function, you must keep track of the state of encoding yourself because Perl won't.	[reply]
Re: How does the built-in function length work? by ikegami (Patriarch) on Dec 02, 2011 at 19:04 UTC
`length` doesn't know anything about encodings. It counts the characters in the string, whether those characters happen to be bytes, Unicode code points or something entirely different. If you pass encoded text to it (bytes), it will count the bytes. If you pass decoded text to it (Unicode code points), it will count the Unicode code points. `Bytes "\xC9\x72\x69\x63" String: C9 72 69 63 Length: 4 Unicode code points "\N{LATIN CAPITAL LETTER E WITH ACUTE}ric" String: C9 72 69 63 Length: 4 Unicode code points "\N{LATIN CAPITAL LETTER E WITH ACUTE}ric\N{RIGHT SINGLE QUOTATION MAR +K}s" String: C9 72 69 63 2019 73 Length: 6` [download] There are many ways of creating each of the above strings. I just listed one as an example. It doesn't matter how the string is created.	[reply] [d/l] [select]
Re^2: How does the built-in function length work? by PerlOnTheWay (Monk) on Feb 10, 2012 at 01:37 UTC
I tried : `print length("\N{LATIN CAPITAL LETTER E WITH ACUTE}ric");` [download] and it's reporting syntax error.	[reply] [d/l]
Re^3: How does the built-in function length work? by chromatic (Archbishop) on Feb 10, 2012 at 02:23 UTC
You need something like this before you can use named Unicode characters: `use charnames ':full';` Improve your skills with Modern Perl: the free book.	[reply] [d/l]
Re^3: How does the built-in function length work? by ikegami (Patriarch) on Feb 12, 2012 at 04:04 UTC
By syntax error, do you mean the following? `Constant(\N{LATIN CAPITAL LETTER E WITH ACUTE}ric) unknown: (possibly +a missing "use charnames ...") at - line 1, within string Execution of - aborted due to compilation errors.` [download] If so, like the message says, it's because you need to add `use charnames ':full';`. If not, could you be more specific? Maybe your version of Perl predates `\N{}`? PS — charnames will be loaded automatically when needed in 5.16.	[reply] [d/l] [select]
Re: How does the built-in function length work? by Anonymous Monk on Dec 02, 2011 at 14:53 UTC
See length and bytes, perlunitut: Unicode in Perl, What if I don't decode?	[reply]
Re^2: How does the built-in function length work? by ww (Archbishop) on Dec 02, 2011 at 15:47 UTC
Last link above appears miscoded; 404. Intended target: http://www.perlmonks.com/index.pl?node_id=551676#what_if_i_don_t_decode or, in another form: ?node_id=551676#what_if_i_don_t_decode It appears that the form `[href://551676#what_if_i_don_t_decode\|What if I don't decode?]` was used in Re: How does the built-in function length work?. It fails because of the missing questionmark after the node number.	[reply] [d/l]