in reply to Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html...

Are you sure the simplified snippet actually reproduces the problem? That the file handle is referenced as both File and as FILE brings your assumption into question.

What does the following show when you put "föö" in foo.txt?

use Data::Dumper qw( Dumper ); use HTML::Entities qw( encode_entities ); open(FILE, '<', 'file.txt'); binmode(FILE); my $content; { local $/; $content = <FILE>; } close(FILE); print "Content-Type: text/html; encoding=utf-8\n\n"; local $Data::Dumper::Useqq = 1; print '<pre>'; print encode_entities(Dumper($content)); print '</pre>'; print $content;
  • Comment on Re: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html...
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html...
by isync (Hermit) on Aug 27, 2007 at 23:15 UTC
    open(FILE, '<', 'file.txt'); # binmode(FILE,":utf8"); my $content; { local $/; $content = <FILE>; } close(FILE); # binmode(STDOUT,":utf8"); use Data::Dumper; use HTML::Entities; print "Content-Type: text/html; encoding=utf-8\n\n"; local $Data::Dumper::Useqq = 1; print '<pre>'; print encode_entities(Dumper($content)); print '</pre>'; print $content;
    gives me:
    $VAR1 = "f\303\266\303\266\n"; föö
    (when viewed in encoding=utf-8)

    open(FILE, '<', 'file.txt'); binmode(FILE,":utf8"); my $content; { local $/; $content = <FILE>; } close(FILE); binmode(STDOUT,":utf8"); use Data::Dumper; use HTML::Entities; print "Content-Type: text/html; encoding=utf-8\n\n"; local $Data::Dumper::Useqq = 1; print '<pre>'; print encode_entities(Dumper($content)); print '</pre>'; print $content;
    gave me:
    $VAR1 = "f\x{f6}\x{f6}\n"; föö
    (but I needed to set encoding=utf-8 manually on this one, was western before..)

    The bad news is: setting the input filehandle and the stdout to :utf8 in my more complex script gives no positive change. Is that an indicator that the string gets "compromised" with non-utf8 somewhere in between? Any ideas?

    UPDATE: Solved! Woa! Got it!! Was a stupid error:
    open(FILE,"<:utf8", "$file"); binmode(FILE);

    Joost's hint "never set the :raw layer (i.e. use binmode(FILEHANDLE) - without a second argument)" in mind I could spot this double-declaration in my code. The problem is always in front of the screen...

    Nevertheless, should it be $VAR1 = "f\303\266\303\266\n"; or $VAR1 = "f\x{f6}\x{f6}\n"; for proper utf8 output? (the latter, right?)

      f\303\266\303\266\n is UTF-8 encoded.
      If it's a string of chars (the UTF-8 flag is set), you'll get UTF-8 when you print to a UTF-8 filehandle.
      If it's a string of octets (the UTF-8 flag is clear), you'll get UTF-8 when you print to a raw filehandle.

      f\x{f6}\x{f6}\n is iso-latin-1 encoded.
      When you print to a UTF-8 filehandle, Perl will assume it's iso-latin-1 and convert it to UTF-8.
      When you print to a raw filehandle, you'll get those exact octets.

        That made everything a lot clearer and the $Data::Dumper::Useqq switch is EXTREMELY helpful! Thanks!