Re: perl unicode docs

A lot of unicode beginners believe that Unicode characters and UTF-8 characters are the same thing.

Unicode is a character set. It has thousands of characters, far more to allow every character to be stored in a byte. UTF-8 is a way of storing Unicode characters, since the system deals with bytes.

Some important background:

Perl's documentation regarding unicode has historically been poor, stemming from Perl 5.6's failed attempt at unicode support.
That paragraph in question attempts to describe a bug in Perl. Confusion is natural.

The usual Perl lingo for "Unicode character scheme" is "Unicode semantics". It refers to the state when /\w/ matches "é" and other iso-8859-1 adorned letters and when \s matches NBSP. The regex engine behaves that way in response to an internal state, thus the bug. uc and similar are also affected.

Unforunately, we're stuck stuck with the bug. People expect \w to only match ASCII letters and people expect to match any Unicode letters, and it usually works for both sets of people. Fixing the bug would mean it would always do one or the other. The bug was therefore fixed via a pragma. If your program has use 5.012; or use feature 'unicode_strings';, unicode semantics will always be on, and the paragraph becomes

The regular expression compiler produces polymorphic opcodes. That is, the pattern adapts to the data and works regardless of the internal encoding of the data.

Comment on Re: perl unicode docs Select or Download Code