in reply to Re^9: Converting Unicode
in thread Converting Unicode

In Unix, the filesystem is always just bytes, but all popular modern software is (depending on the LC environment vars) assuming those bytes can be decoded as UTF-8 and does so. In Windows, all paths are unicode, but use an 8-bit locale unless you use 16-bit wide character APIs, and Perl has always been fairly broken when using international filenames on Windows because perl uses the 8-bit APIs. It's only recently that Win10 introduced the UTF-8 Application Codepage that lets Perl see UTF-8 via those 8-bit APIs.

To the best of my knowledge, Perl only ever sees filenames as bytes and the user must handle all decoding and encoding. It results in a lot of ugly code. I wrote a whole investigative meditation about it, and looked at Python's handling of the problem for comparison. I also suggested solving it as part of a virtual filesystem module for perl.

Meanwhile, I'm a native English speaker and the only time I run into these problems are when filenames of my music collection use foreign characters, or a few cases where I was trying to make backups of client files that contain smart quotes. I can only imagine how frustrating this would be to someone with an asian language who probably uses UTF-8 for every directory and filename. Python 3 has "solved" the problem about as much as it can be solved, and I wouldn't expect to get many new perl users from asian countries if this is one of the problems they run into regularly. Or in other words, I think it ought to be a higher priority to fix this.

Replies are listed 'Best First'.
Re^11: Converting Unicode
by Polyglot (Chaplain) on Dec 06, 2023 at 02:27 UTC
    Thank you so much for chiming in here. Perl is not fully unicode compatible yet...but people who don't use unicode regularly, particularly Asian scripts, will likely be oblivious to this and unable to understand the situation. Your points are valid and need more attention.

    I attended a week of Python training last year. At the time I smugly felt Perl to be superior in many ways. Now I'm wondering if I should pursue it more seriously. Python does have its advantages, even if I feel bothered by its strict formatting rules.

    Blessings,

    ~Polyglot~

      Perl assumes everything (except newlines) is bytes unless you tell it otherwise. Python 2 did the same. Python 3 assumes (almost) everything is utf-8 unless you tell it otherwise. Of the three, Python 3 is arguably the most broken. I can attest to this having to occasionally work with python and ISO-8859-15 files.

      Tom Christiansen's answer on Stack Overflow seems to be the definitive answer to why perl doesn't do it this way. perldoc perluniintro, perlunitut, perlunifaq, and perlunicode should give you most of what you want to know about unicode in perl.

        Are you implying that if one is using nothing other than UTF8 (i.e. no need for ISO-8859-15), Python 3 might actually handle just fine?

        If the "brokeness" of Python is because it cannot handle non-UTF8 properly, that would not impact me at all, as everything I'm doing is with UTF8.

        Blessings,

        ~Polyglot~