in reply to UTF-8 issues with Perl in general and with Spreadsheet::WriteExcel

Please note that you have an asymmetry in your code: You want to encode output, but don't decode input. This is why things go all wrong, and you see Mojibake in your output.

So, whenever you read something form STDIN, also do

binmode STDIN, ':encoding(UTF-8)'; # then you can do: while (<STDIN>) { # work with $_ here }

This decodes the input. Then use utf8; to tell perl that your script is written in UTF-8 (note that ASCII is a valid subset of UTF-8).

Test that your terminal actually understands UTF-8, as described in this article, which also might be of general interest for you.

SpreadSheet::WriteExcel works correctly if you supply it with decoded text strings.

Update: clarified what I mean with the while-loop.

Perl 6 - links to (nearly) everything that is Perl 6.

Replies are listed 'Best First'.
Re^2: UTF-8 issues with Perl in general and with Spreadsheet::WriteExcel
by elef (Friar) on Jul 16, 2010 at 10:19 UTC
    Thank you, that sounds convincing.

    I tried simply adding "<:utf8" to the open command in the Spreadsheet:WriteExcel script and it seems to have fixed the problem.

    I understand how the same concept applies to input from STDIN in the first script, but I don't understand the actual code. Why is the while loop necessary and what do I put inside the loop? And what's the scope of "binmode STDIN..."? All that follows or just the next instance when STDIN is used? (i.e. do I just include it once at the start of the script or before each input from STDIN? - your post seems to suggest I need to add this line every time I expect input with fancy characters.)

      do I just include it once at the start of the script or before each input from STDIN?

      Including it once is sufficient.  It adds another PerlIO layer to the file handle (STDIN here), which remains in effect for the lifetime of the file handle (or until you change it again with another binmode).

Re^2: UTF-8 issues with Perl in general and with Spreadsheet::WriteExcel
by elef (Friar) on Jul 16, 2010 at 15:20 UTC
    Well, I only use STDIN to get user input, which is always just one line, which I store in a variable and then use it for whatever purpose later... So I don't see how a while loop would be useful. Anyway, the more I know about this stuff, the less I understand it. I tried just adding binmode STDIN, ':encoding(UTF-8)'; to the script above, now I get a different problem: error messages of this sort: utf8 "\xFB" does not map to Unicode at [script] line 8. The output file contains the character codes instead of the characters: \xFB\x{32CB8E1}\x82\xA0

    Maybe I should be using encode() and decode() but I just don't know how they relate to "use utf8", and "binmode :encoding(UTF-8)". This is a huge mess and I feel like I'm having to fight a hundred dragons just to get some damned characters to display correctly. Why everything isn't in UTF-8 in the first place is beyond me, it's 2010 for God's sake!

    Anyway, I ran the test from your link ( http://perlgeek.de/en/article/encodings-and-unicode ) as well. The results are not good: all 4 lines are mojibake. The dragons are clearly winning.
      So I don't see how a while loop would be useful

      It was an example, with the purpose of demonstrating that you need to set the IO layer only once, and not before every reading operation. Of course you are welcome to deviate from the example.

      utf8 "\xFB" does not map to Unicode at script line 8.

      That means that your input is not in UTF-8. Find out which character encoding it is, and use the name in the :encoding($encoding_name) IO layer.

      Maybe I should be using encode() and decode() but I just don't know how they relate to "use utf8", and "binmode :encoding(UTF-8)".

      use utf8; has the same effect as adding a decode_utf8 before every string literal in your program. the :encoding(UTF-8) IO layer has the same effect as wrapping input operations in decode calls and output operations in encode calls.

      The results are not good: all 4 lines are mojibake.

      Then your next step should be either to find out which character encoding your terminal works with, or set it up to use UTF-8.

      Perl 6 - links to (nearly) everything that is Perl 6.