Unicode was designed during the era where the concept of "byte stream" had become ubiquitous. I see but little evidence in the design of Unicode that its designers had much appreciation for the prior, messier era nor especially for the fact that what they were designing was going to destroy the then-current comfortable "everything is a byte stream" world.

There were clearly plans for the eventual utopia of "all character strings/streams are in Unicode" but if there were plans for the uncomfortable transition period that we are currently moving into the heart of (things will still continue to get worse for a little while longer before they start getting better), I haven't seen much evidence of that.

I'd expect to see leadership on this front from one or more sources of Unix operating systems (Linux, BSD, Sun, etc.). If it is there, I really haven't seen it. I still haven't seen evidence of a plan for this transition in Unix. I see incomplete pieces that try to cover the "before" case (everything is in Latin-1 or whatever) and try to cover the "after" case (everything is UTF-8), but little that deals with the mixed bag one currently usually finds oneself in, such as: I want to move forward with UTF-8 data in many of my files but tool Z can't handle that so I need to use Latin-1 data for Z but I want to keep using filenames and command lines written in my preferred Latin-2 for now.

But dealing with this transition requires defining ways for applications to declare what type of data they are prepared to deal with (covering many different interfaces: command-line arguments, environment variables, file names, text streams, etc.).

A relatively simple approach that I'd expect to see in Unix would be to enable a Unix to be built such that all text is stored in UTF-8 (file names, environment variable names and their values, text in configuration files, etc.). Then be able to declare that "application Z" only understands Latin-1 and so "application Z" gets passed an environment encoded in Latin-1 and has filename accesses translated between Latin-1 and UTF-8 for it, and text streams also get converted for it.

Actually, Win32 is over a decade ahead of Unix on this front. WinNT did all system work in UTF-16 and let each program declare whether it wanted to use single-byte characters or "UNICODE" characters. Programs can even do a little bit of extra work and access both the single-byte-character APIs and the native UTF-32 APIs.

That is why it was relatively easy for me to add Unicode support for file-system operations to Win32API::File (now if only I could finish testing and integrating those changes and get them uploaded to CPAN).

But Perl has followed along with Unix and is still mostly unprepared for the ugly middle ground we often currently find ourselves in. But Perl is also unprepared for the eventual "all characters are UTF-8" utopia. But I think that part of the proper way to prepare for that future utopia is to define much better ways for declaring what encoding should be used on the different interfaces. Perl has finally gotten a good start on that when it comes to streams (if anything, there may be too many choices, but that is a good way to figure out what the best choices should be). And Perl has an acceptable start on dealing with the dual nature for its own character strings.

But Perl has yet to define great ways of reconciling Unicode with filenames, environment variables, command lines, usernames, hostnames, etc. And a very simple "all interfaces want UTF-8" option seems like a wise goal to work toward.

And I completely disagree that it is a good thing to force one to separately declare UTF-8ness on every interface. I think it is good to allow such, especially now, if convient. But UTF-8 is not some complex structure like PDF, HTML, JSON, etc. UTF-8 is very much like the choice between Latin-1 and Latin-2 (a choice of locale). It would be best if Perl could just notice "Oh, look, I'm finally running in the 'all is UTF-8' utopia" and work accordingly. I doubt anybody will ever be in a situation where "all streams are HTML", Unix username are HTML, Perl strings know whether they are HTML or just the backward-compatible "plain text", etc.

- tye        


In reply to Re: Pragma to handle unicode characters (roads) by tye
in thread Pragma to handle unicode characters by wanradt

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.