comment on

The important point is that a Unicode code point is an abstract concept. It's just a mapping between a character and a number. A concrete instantiation of a Unicode code point requires that the number be expressed in some encoding.

Perl's internal representation of character strings (as opposed to byte strings) uses UTF-8 (with some minor differences which I will ignore here).

Therefore if you want to match a regular expression against a character string, both the string and the regular expression must be expressed in a way that allows Perl to represent them in UTF8. When this is done, Perl will be able to recognise characters even when they are more than one byte long.

An example may help. Consider this variable assignment:

  my $currency = "\x{20AC}";   # The Euro symbol
[download]

The $currency variable now contains a UTF8 representation of unicode character U+20AC. Although length($currency) will return 1 (since the string is 1 character long), the string will actually contain the three bytes: E2 82 AC

So this will match:

perl -le '"\x{20AC}" =~ /\x{20AC}/ && print "Match!"'

and this will not

perl -le '"\xE2\x82\xAC" =~ /\x{20AC}/ && print "Match!"'

even though the two strings have identical contents, in one case it's a character string and in the other case it's a byte string.

In reply to Re: perl unicode docs by grantm
in thread perl unicode docs by 7stud

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.