in reply to Re^4: Converting Unicode
in thread Converting Unicode

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re^6: Converting Unicode
by NERDVANA (Priest) on Dec 03, 2023 at 07:42 UTC

    I understand the argument you're making, but I disagree about the word "compatible". I think a more accurate way of saying it is "Perl does not assume a unicode environment", "the unicode support is opt-in", and "getting Perl to treat its environment as unicode requires a lot of tedious steps".

    For contrast, Python 3 does assume a unicode environment, giving people that convenient out-of-the-box support feel, but Python 2 did not, and it caused a great deal of breakage to change that assumption. Perl will probably never change the default, in order to maintain backward compatibility. There are many environments that really still aren't Unicode, and Perl still needs to run in those. There are in fact many more environments Perl can run in than Python, because of that.

    I do wish, though, that there was a simpler option like an environment variable or command line switch that would make Perl assume a unicode environment. That option would probably break a bunch of modules and scripts, and would still need to be opt-in, but people could gradually start supporting it in the same way that we can run perl with Taint checking and see what that breaks. Most importantly though, having it be a single switch rather than dozens of switches all over would make a massive difference for convenience.

      > I do wish, though, that there was a simpler option like ... command line switch that would make Perl assume a unicode environment

      What about the -C-options ? What's missing from your perspective?

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

        So actually I had forgotten about it, and then when you reminded me I was going to say "oh right, but it doesn't apply to modules, only the main script", and then I gave it a test and actually it works..?

        perl -C -E 'use Path::Tiny; say length((path("unicode.txt")->lines)[0])'

        but that seems to disagree with the documentation:

        The io options mean that any subsequent open() (or similar I/O operations) in main program scope will have the :utf8 PerlIO layer implicitly applied to them, in other words, UTF-8 is expected from any input stream, and UTF-8 is produced to any output stream. This is just the default set via ${^OPEN}, with explicit layers in open() and with binmode() one can manipulate streams as usual. This has no effect on code run in modules.

        Though, still, the one thing missing is unicode handling of file names, such as the return values of readdir.

        Edit:

        So actually the documentation is correct, it only applies to the main module. Path::Tiny is just very smart about doing what you mean, because it calls

        my $binmode = $args->{binmode}; $binmode = ( ( caller(0) )[10] || {} )->{'open<'} unless defined $ +binmode; my $fh = $self->filehandle( { locked => 1 }, "<", $binmode );

        So, no, -C isn't what I'm talking about. I mean a perl-wide change of defaults that makes all text (non-binmode("raw")) default to decoding UTF-8 in all modules everywhere. ...because that would fix Polyglot's Test::More problem without monkeying around with file handles private within other modules.

        Well, these may not be in core, but for one thing database operations have always been tricky, and (to my knowledge) no "flag" at the top of one's code ever solved that, e.g. with DBI or DBD::mysql. Despite the fact that input/output from a database might be thought by the coder to be part of the overall I/O for the purposes of encoding, it isn't treated as such, and must be dealt with separately. The handoff between Perl and the DB had to ensure that both were on the same page with the encoding, and for the programmer, keeping track of whether or not a particular item had been encoded or decoded was always a burden, as it was quite possible to overdo either one--Perl would happily allow this (to dastardly results). Then there's other external modules such as CGI, etc. CGI was in core, but it was never UTF8 by default. It also had to be given special instructions to enable and/or convert to utf8 for such things as HTML form input/output. There seem to be many hidden gotchas with coding for unicode, which is why the coder must be alert and prepared for these all throughout the process. "Wide characters" tend to show up when least expected, and can really make a confusing mess of things.

        Blessings,

        ~Polyglot~

Re^6: Converting Unicode
by ikegami (Patriarch) on Dec 03, 2023 at 22:36 UTC

    When something is compatible with a standard, it means it follows the standard.

    When something supports a standard, it means it follows the standard.

    They do indeed mean the same thing. Perhaps you should say what you mean instead of repeatedly insisting these two things don't mean the same thing?


    It does not, however, qualify as being fully "compatible"--as the code must be specially adapted to use UTF8.

    Nonsense.

    My TV is fully compatible with multiple input protocols. But I still have to tell it which one to use.

    I have a device that's fully compatible with both the North American and European power grids, but a switch needs to be placed in the correct position before it's powered.

    To be fully compatible with Unicode does not require handles to provided decoded text by default, and it doesn't require handles to encode text by default. It doesn't require decoding or encoding at all, much less by default.


    Does Perl support Unicode? In a sense yes. It allows Unicode to be used

    Supporting Unicode means a lot more than that.

    A reply falls below the community's threshold of quality. You may see it by logging in.