isync has asked for the wisdom of the Perl Monks concerning the following question:

My file was written with gedit as utf-8 under linux. I uploaded it as binary and when I look at it in the browser, all umlauts etc are ok (although I need to manually set my browser to encoding=utf-8..)

I use a small script to read in this .txt file and present it as html on a standard webpage which I serve as encoded in utf-8 (why the browser properly recognizes it this time). But the umlauts etc. on the page end up as garbled data then! ö is ö, ä is ä, ü is ü... Looks like a strange double-encoding bug, but I do not alter the data! Manually setting the browser to other encodings is of zero use.
open(FILE, "<$file"); binmode(FILE); my @Lines = <FILE>; close(FILE); $page = "<html>". join("",@Lines) ."</html>"; print "Content-Type: text/html; encoding=utf-8\n\n"; print $page;
(simplified snippet) ---updated: filehandle is always "FILE"
  • Comment on Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html...
  • Download Code

Replies are listed 'Best First'.
Re: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html...
by Joost (Canon) on Aug 27, 2007 at 18:44 UTC
    You need to open the $file as utf-8 (or set the utf-8 layer using binmode), and set the STDOUT mode to utf-8 too:
    binmode(STDOUT,":utf8"); open(FILE,"<:utf8", "$file"); my @Lines = <FILE>; close(FILE); $page = "<html>". join("",@Lines) ."</html>"; print "Content-Type: text/html; encoding=utf-8\n\n"; print $page;
    See also perlunicode

    update: also perlio - and note that you should never set the :raw layer (i.e. use binmode(FILEHANDLE) - without a second argument) on a (unicode) text file unless you're sure you know what you're doing.

      Woa! Got it!! Was a stupid error:
      open(FILE,"<:utf8", "$file"); binmode(FILE);

      Your hint "never set the :raw layer (i.e. use binmode(FILEHANDLE) - without a second argument)" in mind I could spot this double-declaration in my code. The problem is always in front of the screen...
      Doh! I need to engrave this :utf8 thing on filehandles in wood! I forget it all the time... But, it did not quite solve the problem. So I tried Ikegami's suggestion (see below).
Re: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html...
by ikegami (Patriarch) on Aug 27, 2007 at 19:00 UTC
    Are you sure the simplified snippet actually reproduces the problem? That the file handle is referenced as both File and as FILE brings your assumption into question.

    What does the following show when you put "föö" in foo.txt?

    use Data::Dumper qw( Dumper ); use HTML::Entities qw( encode_entities ); open(FILE, '<', 'file.txt'); binmode(FILE); my $content; { local $/; $content = <FILE>; } close(FILE); print "Content-Type: text/html; encoding=utf-8\n\n"; local $Data::Dumper::Useqq = 1; print '<pre>'; print encode_entities(Dumper($content)); print '</pre>'; print $content;
      open(FILE, '<', 'file.txt'); # binmode(FILE,":utf8"); my $content; { local $/; $content = <FILE>; } close(FILE); # binmode(STDOUT,":utf8"); use Data::Dumper; use HTML::Entities; print "Content-Type: text/html; encoding=utf-8\n\n"; local $Data::Dumper::Useqq = 1; print '<pre>'; print encode_entities(Dumper($content)); print '</pre>'; print $content;
      gives me:
      $VAR1 = "f\303\266\303\266\n"; föö
      (when viewed in encoding=utf-8)

      open(FILE, '<', 'file.txt'); binmode(FILE,":utf8"); my $content; { local $/; $content = <FILE>; } close(FILE); binmode(STDOUT,":utf8"); use Data::Dumper; use HTML::Entities; print "Content-Type: text/html; encoding=utf-8\n\n"; local $Data::Dumper::Useqq = 1; print '<pre>'; print encode_entities(Dumper($content)); print '</pre>'; print $content;
      gave me:
      $VAR1 = "f\x{f6}\x{f6}\n"; föö
      (but I needed to set encoding=utf-8 manually on this one, was western before..)

      The bad news is: setting the input filehandle and the stdout to :utf8 in my more complex script gives no positive change. Is that an indicator that the string gets "compromised" with non-utf8 somewhere in between? Any ideas?

      UPDATE: Solved! Woa! Got it!! Was a stupid error:
      open(FILE,"<:utf8", "$file"); binmode(FILE);

      Joost's hint "never set the :raw layer (i.e. use binmode(FILEHANDLE) - without a second argument)" in mind I could spot this double-declaration in my code. The problem is always in front of the screen...

      Nevertheless, should it be $VAR1 = "f\303\266\303\266\n"; or $VAR1 = "f\x{f6}\x{f6}\n"; for proper utf8 output? (the latter, right?)

        f\303\266\303\266\n is UTF-8 encoded.
        If it's a string of chars (the UTF-8 flag is set), you'll get UTF-8 when you print to a UTF-8 filehandle.
        If it's a string of octets (the UTF-8 flag is clear), you'll get UTF-8 when you print to a raw filehandle.

        f\x{f6}\x{f6}\n is iso-latin-1 encoded.
        When you print to a UTF-8 filehandle, Perl will assume it's iso-latin-1 and convert it to UTF-8.
        When you print to a raw filehandle, you'll get those exact octets.