in reply to Re: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html...
in thread Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html...

open(FILE, '<', 'file.txt'); # binmode(FILE,":utf8"); my $content; { local $/; $content = <FILE>; } close(FILE); # binmode(STDOUT,":utf8"); use Data::Dumper; use HTML::Entities; print "Content-Type: text/html; encoding=utf-8\n\n"; local $Data::Dumper::Useqq = 1; print '<pre>'; print encode_entities(Dumper($content)); print '</pre>'; print $content;
gives me:
$VAR1 = "f\303\266\303\266\n"; föö
(when viewed in encoding=utf-8)

open(FILE, '<', 'file.txt'); binmode(FILE,":utf8"); my $content; { local $/; $content = <FILE>; } close(FILE); binmode(STDOUT,":utf8"); use Data::Dumper; use HTML::Entities; print "Content-Type: text/html; encoding=utf-8\n\n"; local $Data::Dumper::Useqq = 1; print '<pre>'; print encode_entities(Dumper($content)); print '</pre>'; print $content;
gave me:
$VAR1 = "f\x{f6}\x{f6}\n"; föö
(but I needed to set encoding=utf-8 manually on this one, was western before..)

The bad news is: setting the input filehandle and the stdout to :utf8 in my more complex script gives no positive change. Is that an indicator that the string gets "compromised" with non-utf8 somewhere in between? Any ideas?

UPDATE: Solved! Woa! Got it!! Was a stupid error:
open(FILE,"<:utf8", "$file"); binmode(FILE);

Joost's hint "never set the :raw layer (i.e. use binmode(FILEHANDLE) - without a second argument)" in mind I could spot this double-declaration in my code. The problem is always in front of the screen...

Nevertheless, should it be $VAR1 = "f\303\266\303\266\n"; or $VAR1 = "f\x{f6}\x{f6}\n"; for proper utf8 output? (the latter, right?)
  • Comment on Re^2: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html...
  • Select or Download Code

Replies are listed 'Best First'.
Re^3: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html...
by ikegami (Patriarch) on Aug 27, 2007 at 23:35 UTC

    f\303\266\303\266\n is UTF-8 encoded.
    If it's a string of chars (the UTF-8 flag is set), you'll get UTF-8 when you print to a UTF-8 filehandle.
    If it's a string of octets (the UTF-8 flag is clear), you'll get UTF-8 when you print to a raw filehandle.

    f\x{f6}\x{f6}\n is iso-latin-1 encoded.
    When you print to a UTF-8 filehandle, Perl will assume it's iso-latin-1 and convert it to UTF-8.
    When you print to a raw filehandle, you'll get those exact octets.

      That made everything a lot clearer and the $Data::Dumper::Useqq switch is EXTREMELY helpful! Thanks!
        you can also use use utf8; you dont have to make it binmode as all strings , input and output will be considered as in perls lax utf8 interpretation.