in reply to Re^6: Converting Unicode
in thread Converting Unicode

> I do wish, though, that there was a simpler option like ... command line switch that would make Perl assume a unicode environment

What about the -C-options ? What's missing from your perspective?

Cheers Rolf
(addicted to the Perl Programming Language :)
see Wikisyntax for the Monastery

Replies are listed 'Best First'.
Re^8: Converting Unicode
by NERDVANA (Priest) on Dec 05, 2023 at 07:37 UTC

    So actually I had forgotten about it, and then when you reminded me I was going to say "oh right, but it doesn't apply to modules, only the main script", and then I gave it a test and actually it works..?

    perl -C -E 'use Path::Tiny; say length((path("unicode.txt")->lines)[0])'

    but that seems to disagree with the documentation:

    The io options mean that any subsequent open() (or similar I/O operations) in main program scope will have the :utf8 PerlIO layer implicitly applied to them, in other words, UTF-8 is expected from any input stream, and UTF-8 is produced to any output stream. This is just the default set via ${^OPEN}, with explicit layers in open() and with binmode() one can manipulate streams as usual. This has no effect on code run in modules.

    Though, still, the one thing missing is unicode handling of file names, such as the return values of readdir.

    Edit:

    So actually the documentation is correct, it only applies to the main module. Path::Tiny is just very smart about doing what you mean, because it calls

    my $binmode = $args->{binmode}; $binmode = ( ( caller(0) )[10] || {} )->{'open<'} unless defined $ +binmode; my $fh = $self->filehandle( { locked => 1 }, "<", $binmode );

    So, no, -C isn't what I'm talking about. I mean a perl-wide change of defaults that makes all text (non-binmode("raw")) default to decoding UTF-8 in all modules everywhere. ...because that would fix Polyglot's Test::More problem without monkeying around with file handles private within other modules.

      > the one thing missing is unicode handling of file names, such as the return values of readdir

      I don't know much about this and how filesystems are handling encodings.

      I was kind of expecting that this is just another matter of IO layers... (?)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

        In Unix, the filesystem is always just bytes, but all popular modern software is (depending on the LC environment vars) assuming those bytes can be decoded as UTF-8 and does so. In Windows, all paths are unicode, but use an 8-bit locale unless you use 16-bit wide character APIs, and Perl has always been fairly broken when using international filenames on Windows because perl uses the 8-bit APIs. It's only recently that Win10 introduced the UTF-8 Application Codepage that lets Perl see UTF-8 via those 8-bit APIs.

        To the best of my knowledge, Perl only ever sees filenames as bytes and the user must handle all decoding and encoding. It results in a lot of ugly code. I wrote a whole investigative meditation about it, and looked at Python's handling of the problem for comparison. I also suggested solving it as part of a virtual filesystem module for perl.

        Meanwhile, I'm a native English speaker and the only time I run into these problems are when filenames of my music collection use foreign characters, or a few cases where I was trying to make backups of client files that contain smart quotes. I can only imagine how frustrating this would be to someone with an asian language who probably uses UTF-8 for every directory and filename. Python 3 has "solved" the problem about as much as it can be solved, and I wouldn't expect to get many new perl users from asian countries if this is one of the problems they run into regularly. Or in other words, I think it ought to be a higher priority to fix this.

Re^8: Converting Unicode
by Polyglot (Chaplain) on Dec 04, 2023 at 20:38 UTC
    Well, these may not be in core, but for one thing database operations have always been tricky, and (to my knowledge) no "flag" at the top of one's code ever solved that, e.g. with DBI or DBD::mysql. Despite the fact that input/output from a database might be thought by the coder to be part of the overall I/O for the purposes of encoding, it isn't treated as such, and must be dealt with separately. The handoff between Perl and the DB had to ensure that both were on the same page with the encoding, and for the programmer, keeping track of whether or not a particular item had been encoded or decoded was always a burden, as it was quite possible to overdo either one--Perl would happily allow this (to dastardly results). Then there's other external modules such as CGI, etc. CGI was in core, but it was never UTF8 by default. It also had to be given special instructions to enable and/or convert to utf8 for such things as HTML form input/output. There seem to be many hidden gotchas with coding for unicode, which is why the coder must be alert and prepared for these all throughout the process. "Wide characters" tend to show up when least expected, and can really make a confusing mess of things.

    Blessings,

    ~Polyglot~

      Expecting the programming language to magically default HTML or relational databases to UTF-8 is quite a stretch.

      Like expecting that human programmers automatically default to octal system to avoid future rounding errors with floats.

      It's just outside the realm of the programming language.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

        No, I don't think it's a stretch. It will happen in due time. Soon everyone will be working with UTF8 by default, or some near equivalent of it (utf8mb4?). I think it might already be the standard if it weren't resisted by those slow to adopt it.

        Blessings,

        ~Polyglot~