pcouderc has asked for the wisdom of the Perl Monks concerning the following question:

Unicode and utf8 are universal in our PCs since years. We shall never come back to Latin1,2,3....
And I am tired to deal with utf8 in perl.

Is there a way to put something somewhere and forget all words such as unicode utf8 and so on. Latex has found a way !
What are the best practices for perl ?

Thanks, o utf8 monks...

  • Comment on utf8 forever OR what are the best practices ...?

Replies are listed 'Best First'.
Re: utf8 forever OR what are the best practices ...?
by kroach (Pilgrim) on Dec 19, 2018 at 10:06 UTC
    I believe the best practice is still doing it on case-by-case basis, at least when dealing with external data, since you never know what you'll get. Utf8 might be widespread but it's not like the other encodings are not used. As for automatic support, I guess this is as far as you can really go:
    use utf8; use open ':encoding(UTF-8)';
    This will allow you to write unicode strings into the source and automatically handle encoding/decoding on read/write in new filehandles. Mind that if you read data from databases, external APIs or whatever other source that is not a filehandle, you will still need to set the specific options or decode it yourself, it will all the depend on the library used. Unfortunately, there is no one magic setting which will set everything to utf8.
      1- I think that "use utf8::all;" does the job... with limitations.
      2- I would be so sorry you be right.
      I want a world with no more encoding. And at least a word where by default I have not to care with encoding. I hate encoding... This is a concept of previous millenary, like writing numbers in octal...
      And all my files and databases are utf8. And yes, I want magic...
Re: utf8 forever OR what are the best practices ...?
by haukex (Archbishop) on Dec 19, 2018 at 10:00 UTC

    I haven't tried it myself, but perhaps utf8::all is worth a look.

      yes, I think it is the best at my knowledge.
      But I had to add {binmode => ':utf8'} to read_file...

        Well, let me take this opportunity to make a more philosophical point: in my experience, it's better to always specify a character encoding when converting from bytes to characters and back (I usually only make an exception in short scripts and/or when I know everything is ASCII). Even though it may seem a bit tedious and verbose, consider the alternative: if there is a default everywhere, then users will not get used to having to choose an encoding. Much confusion has been caused by programmers sometimes not even being aware of where en- and decoding processes are taking place. For example, AFAIK early versions of the Java standard libraries made this mistake, and many places where Strings were converted to and from arrays of bytes (especially on I/O), a default encoding was used, which as far as I can remember was just the platform default, for example Latin1 (causing issues when the files were actually encoded in e.g. CP1252 or UTF-8). If you look at it this way, then maybe you can see how specifying an encoding explicitly is like coding defensively.