A lot of unicode beginners believe that Unicode characters and UTF-8 characters are the same thing.

Unicode is a character set. It has thousands of characters, far more to allow every character to be stored in a byte. UTF-8 is a way of storing Unicode characters, since the system deals with bytes.

Some important background:

The usual Perl lingo for "Unicode character scheme" is "Unicode semantics". It refers to the state when /\w/ matches "é" and other iso-8859-1 adorned letters and when \s matches NBSP. The regex engine behaves that way in response to an internal state, thus the bug. uc and similar are also affected.

Unforunately, we're stuck stuck with the bug. People expect \w to only match ASCII letters and people expect to match any Unicode letters, and it usually works for both sets of people. Fixing the bug would mean it would always do one or the other. The bug was therefore fixed via a pragma. If your program has use 5.012; or use feature 'unicode_strings';, unicode semantics will always be on, and the paragraph becomes

The regular expression compiler produces polymorphic opcodes. That is, the pattern adapts to the data and works regardless of the internal encoding of the data.


In reply to Re: perl unicode docs by ikegami
in thread perl unicode docs by 7stud

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.