fibokowalsky has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, - I want to read data from the file and process it but there is a problem with character encoding. http://ufal.mff.cuni.cz/~hajic/courses/npfl067/TEXTCZ1.txt Simply, I am not able to read file, convert content in utf8 and and print correctly to STDOUT. I tried opening with: UTF16 (detected by guess_encoding from Encode::Guess) and UCS-2 (detected by enca in Unix) but I couldn't;
open my $FILE, "<:encoding(UCS-2)", 'TEXTCZ1.txt' or die "Cannot open +the TEXTCZ1.txt! \n"; my $content = do { local $/; <$FILE> };
I can get the text in utf8 in different way but that's not nice, like:
my $text = get("http://ufal.mff.cuni.cz/~hajic/courses/npfl067/TEXTCZ1 +.txt"); open my $FILE, '>', 'file.txt' or die "$!\n"; print $FILE $text; open $FILE, "<:encoding(iso-8859-2)", 'file.txt' or die "$!\n"; open my $NEWFILE, ">:encoding(utf-8)", 'fileutf8.txt' or die "$!\n"; print $NEWFILE $_ while <$FILE>; open $FILE, "<:encoding(utf-8)", 'fileutf8.txt' or die "$!\n"; $content = do { local $/; <$FILE> };
Do you have any suggestions to reda directly from file as utf8? I also do not know why text from get() is different than in the file.

Replies are listed 'Best First'.
Re: Reading File with Czech text inside
by ig (Vicar) on Feb 27, 2011 at 04:14 UTC

    Does this give you what you are looking for?

    use strict; use warnings; use LWP::Simple; use Encode; my $text = decode( 'iso-8859-2', get("http://ufal.mff.cuni.cz/~hajic/courses/npfl067/TEXTCZ1.txt") );
      Thank u for the reply. Using get, I can read the text normally in utf8 but I want to read already downloaded file with open. The script should not require connection.
Re: Reading File with Czech text inside
by CountZero (Bishop) on Feb 27, 2011 at 17:50 UTC
    I tried this (I could not get access to your URL, so I used another Czech text file):
    use Modern::Perl; use IO::All; my $text < io('http://web.etf.cuni.cz/ETF-104-version1.txt'); say $text;
    And the text came through nicely (showing just one line with a lot of Czech characters):
    Vzhledem k výši provozních nákladů při barevném režimu tisku či barevném kopírování vyhlaš zaměstnance a spolupracovníky fakulty tato pravidla a poplatky za užívání:

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Thank u for the reply. Using get, I can read the text normally in utf8 but I want to read already downloaded file with open. The script should not require connection.
        No problem. I downloaded the text, copied and pasted it into Notepad and saved it as UTF-8 text.

        The following program reads it without any problem:

        use Modern::Perl; use IO::All; my $text < io('./czech-test.txt'); say $text;
        And yes, IO::All is that simple, it takes care of all the low-level details and doesn't care if the source or destination of your operation is a file, an FTP server, HTTP(s), a socket, a pipe, a scalar variable, a database, ...

        It just Does the Right Thing.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Reading File with Czech text inside
by ikegami (Patriarch) on Feb 27, 2011 at 21:04 UTC

    The second snippet without a temp file:

    use Encode qw( decode ); my $url = "http://ufal.mff.cuni.cz/~hajic/courses/npfl067/TEXTCZ1.txt" +; my $content = decode('iso-8859-2', get($url));

    Encode

    If you also want to write to the files,

    { open my $FH, ">:encoding(iso-8859-2)", 'file_iso-8859-2.txt' or die "$!\n"; print $FH $content; } { open my $FH, ">:encoding(utf-8)", 'file_utf-8.txt' or die "$!\n"; print $FH $content; }
      Thank u for the reply. Using get, I can read the text normally in utf8 but I want to read already downloaded file with open my $FILE .... "filename";. The script should not require connection.

        If the file is encoded using iso-8859-2 (like file_iso-8859-2.txt above),

        open my $FH, "<:encoding(iso-8859-2)", 'file_iso-8859-2.txt' or die "$!\n";

        If the file is encoded using UTF-8 (like file_utf-8.txt above),

        open my $FH, "<:encoding(UTF-8)", 'file_utf-8.txt' or die "$!\n";
Re: Reading File with Czech text inside
by elef (Friar) on Feb 27, 2011 at 07:41 UTC
    Try open my $FILE, "<:raw:perlio:encoding(UTF-16LE):crlf", 'TEXTCZ1.txt' or die "Cannot open TEXTCZ1.txt! \n";
    See http://perlmonks.org/?node_id=868428
Re: Reading File with Czech text inside
by elef (Friar) on Feb 27, 2011 at 07:41 UTC
    ignore me