in reply to text encodings and perl

I'm sure that anyone getting started with unicode in perl will find your explanation useful -- nice post. But I think this part is a bit misleading:
Note, it is not so important which encoding is used by the "internal form". It can be any. Important is only that it is "internal", so it shouldn't be passed to external entities.

First, it actually is important that the "internal form" is (very much like) utf8 unicode. This means that ASCII characters actually are ASCII (single-byte) characters, while everything really is Unicode (*), so that:

All that stuff tends to make multi-language string processing a lot easier.

Second, as for passing "internal format" strings to "external entities", this isn't necessarily a problem. A "perl-internal" utf8 string can be passed for insertion into a database table via DBI without further ado, or printed directly to a file handle if the file was opened for output with the ":utf8" IO layer.

(* Update: well, the characters in the range U+0080 - U+00FF have some "special behaviors", but they really can be treated just like any other non-ASCII character.)

Replies are listed 'Best First'.
Re^2: text encodings and perl
by Anonymous Monk on Nov 13, 2010 at 09:57 UTC
    it actually is important that the "internal form" is (very much like) utf8 unicode.
    I'm not sure I can follow your arguments. Which of those desirable properties wouldn't be possible if Perl had a different internal unicode string representation? Other languages like Java or Python have chosen different internal representations, yet they are perfectly capable of doing regex matches or parsing string literals (analogous to "\x{abcd}" in Perl) into their internal form.

    It's just a matter of how things are implemented. Of course, different implementations have different pros and cons with respect to performance (speed/memory) or ease of implementation, but I don't see why utf8 would be required as the internal form to realize the properties you mentioned.