The important point is that a Unicode code point is an abstract concept. It's just a mapping between a character and a number. A concrete instantiation of a Unicode code point requires that the number be expressed in some encoding.
Perl's internal representation of character strings (as opposed to byte strings) uses UTF-8 (with some minor differences which I will ignore here).
Therefore if you want to match a regular expression against a character string, both the string and the regular expression must be expressed in a way that allows Perl to represent them in UTF8. When this is done, Perl will be able to recognise characters even when they are more than one byte long.
An example may help. Consider this variable assignment:
my $currency = "\x{20AC}"; # The Euro symbol
The $currency variable now contains a UTF8 representation of unicode character U+20AC. Although length($currency) will return 1 (since the string is 1 character long), the string will actually contain the three bytes: E2 82 AC
So this will match:
perl -le '"\x{20AC}" =~ /\x{20AC}/ && print "Match!"'and this will not
perl -le '"\xE2\x82\xAC" =~ /\x{20AC}/ && print "Match!"'even though the two strings have identical contents, in one case it's a character string and in the other case it's a byte string.
In reply to Re: perl unicode docs
by grantm
in thread perl unicode docs
by 7stud
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |