I'm sure that anyone getting started with unicode in perl will find your explanation useful -- nice post. But I think this part is a bit misleading:
Note, it is not so important which encoding is used by the "internal form". It can be any. Important is only that it is "internal", so it shouldn't be passed to external entities.
First, it actually is important that the "internal form" is (very much like) utf8 unicode. This means that ASCII characters actually are ASCII (single-byte) characters, while everything really is Unicode (*), so that:
- the Unicode character properties work as expected in regular expressions
- Unicode code point numerics (e.g. "\x{abcd}") can be used in regexes or double-quoted strings
- character normalization works according to Unicode specifications (cf. Unicode::Normalize),
- normal string sorting works according to the established Unicode code-point order
- other collations (e.g. character sort ordering for particular languages) implement Unicode-based specifications (see various Unicode::Collate modules on CPAN).
All that stuff tends to make multi-language string processing a lot easier.
Second, as for passing "internal format" strings to "external entities", this isn't necessarily a problem. A "perl-internal" utf8 string can be passed for insertion into a database table via DBI without further ado, or printed directly to a file handle if the file was opened for output with the ":utf8" IO layer.
(* Update: well, the characters in the range U+0080 - U+00FF have some "special behaviors", but they really can be treated just like any other non-ASCII character.)
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.