in reply to Re: Mess with UTF-8, utf8 and raw encoding on live working platform
in thread Mess with UTF-8, utf8 and raw encoding on live working platform

Thanks for the answer, but before this I need to understand when I need to write :encoding(UTF-8) and when not.
For example I wrote a script that take info from all users and put it in index.
open FH , "<:encoding(UTF-8)","$dir/private_data";
Process the data
open FH ,">:encoding(UTF-8)","$path_index";
After that another web script take the data from this index file
open FH ,"<:encoding(UTF-8)","$path_index";
And prints all the data into HTML page. All the German chars with dots and other unique chars , printed as question mark inside polygon.
But if I use the regular way
open FH ,"<","$path_index";
All the chars shows as it needed to be... So I don't really understand why it happens.

Replies are listed 'Best First'.
Re^3: Mess with UTF-8, utf8 and raw encoding on live working platform
by moritz (Cardinal) on Jun 02, 2011 at 14:51 UTC
    There are two types of string variables in Perl. One type contains text, the other contains bytes.

    Files contain bytes. If you want to write a text string to a file, the string needs to be converted from text to bytes. That's what the >:encoding(UTF-8) does.

    OTOH if you want to read data from a file, without the :encoding(UTF-8) the string will contain bytes, and with it the string contains text.

    Sad thing is, you can't reliably see from looking at a string if it's text or bytes. And if you mix the two up, you will see some broken output.

    So if you use text strings internally in your program, you need the :encoding(UTF-8) both for reading and writing files, and you need to decode all other byte strings that come into your program (for example with %ENV or @ARGV).

    OTOH some modules already decode strings for you (for example XML and JSON parsers), so you must be aware which module does that.

      I found in some Russian article to add this line
      use encoding 'UTF-8'
      in each file ... Now all works
      Thanks for all
Re^3: Mess with UTF-8, utf8 and raw encoding on live working platform
by mje (Curate) on Jun 02, 2011 at 14:35 UTC

    I don't think anyone can tell you precisely when you need to add an encoding layer on files open for read because it depends on whether the data in the input file is UTF-8 (in your case) encoded or not. If private_data is UTF-8 encoded then you need to add the UTF-8 encoding layer to decode it on input. If path_index is intended to be UTF-8 encoded file you need to encode the data when writing it.

    When you say "prints all the data into HTML page" are you running some sort of CGI, in web server code because if you are you need to set the the doctype in the HTML and probably on the content-type too.