Re: How does the built-in function length work?
by moritz (Cardinal) on Dec 02, 2011 at 14:42 UTC
|
There are two ways that perl 5 stores strings. If the UTF8 flag is set, length returns the number of characters, not bytes.
If the flag is not set, perl assumes that the encoding is ISO-8859-1, and there the number of bytes is equal to the number of characters.
| [reply] |
|
|
I usually always agree 100% with your posts, but not today.
Perl operators that deal with text (regex and uc and the like) expect Unicode code points. (Formerly: Expected Unicode code points or ASCII depending on UTF8 flag.)
Perl operators that deal with file names (open, stat, etc) expect the file names to be bytes.
Perl never assumes or expects iso-8859-1.
| [reply] |
|
|
$ echo -e "\xE4"|perl -wE 'say <> ~~ /\w/'
1
$ # this a perl 5.14.1
Since no decoding step happened here, and <> is a binary operation, and the regex match a text operation, perl has to assume a character encoding. And that happens to be ISO-8859-1. Or what do you think it is, if not ISO-8859-1?
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
Re: How does the built-in function length work?
by Eliya (Vicar) on Dec 02, 2011 at 14:06 UTC
|
No, it works with decoded strings (i.e. it doesn't have to guess encodings).
| [reply] |
|
|
| [reply] |
|
|
Maybe in your string then, the number of octets and number of characters is the same?
The following shows that Perl does not guess the encoding of strings but assumes it:
# Perl assumes it's a Latin-1 string
> perl -MEncode -wle print+length(qq(\x{c3}\x{a4}))
2
# Perl gets told to decode the string from UTF-8
> perl -MEncode -wle print+length(decode('UTF-8',qq(\x{c3}\x{a4})))
1
# My terminal is Latin-1, which happens to match Perls default assumpt
+ion
> perl -MEncode -wle print(length(decode('Latin-1',qq(ä))))
1
Update: choroba pointed out that I mispasted the second example - now corrected. | [reply] [d/l] [select] |
|
|
It depends... sometimes you do have to decode them, sometimes you don't, because Perl (or some module etc.) has already done it for you.
In any case, for Perl to be able to work with character strings (as opposed to byte/octet strings), the string must have been decoded somehow into Perl's internal Unicode representation.
| [reply] |
|
|
| [reply] |
Re: How does the built-in function length work?
by ikegami (Patriarch) on Dec 02, 2011 at 19:04 UTC
|
length doesn't know anything about encodings. It counts the characters in the string, whether those characters happen to be bytes, Unicode code points or something entirely different.
If you pass encoded text to it (bytes), it will count the bytes.
If you pass decoded text to it (Unicode code points), it will count the Unicode code points.
Bytes
"\xC9\x72\x69\x63"
String: C9 72 69 63
Length: 4
Unicode code points
"\N{LATIN CAPITAL LETTER E WITH ACUTE}ric"
String: C9 72 69 63
Length: 4
Unicode code points
"\N{LATIN CAPITAL LETTER E WITH ACUTE}ric\N{RIGHT SINGLE QUOTATION MAR
+K}s"
String: C9 72 69 63 2019 73
Length: 6
There are many ways of creating each of the above strings. I just listed one as an example. It doesn't matter how the string is created.
| [reply] [d/l] [select] |
|
|
print length("\N{LATIN CAPITAL LETTER E WITH ACUTE}ric");
and it's reporting syntax error. | [reply] [d/l] |
|
|
| [reply] [d/l] |
|
|
Constant(\N{LATIN CAPITAL LETTER E WITH ACUTE}ric) unknown: (possibly
+a missing "use charnames ...") at - line 1, within string
Execution of - aborted due to compilation errors.
If so, like the message says, it's because you need to add use charnames ':full';.
If not, could you be more specific? Maybe your version of Perl predates \N{}?
PS — charnames will be loaded automatically when needed in 5.16.
| [reply] [d/l] [select] |
Re: How does the built-in function length work?
by Anonymous Monk on Dec 02, 2011 at 14:53 UTC
|
| [reply] |
|
|
| [reply] [d/l] |