Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html...

isync has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html... by Joost (Canon) on Aug 27, 2007 at 18:44 UTC
You need to open the $file as utf-8 (or set the utf-8 layer using binmode), and set the STDOUT mode to utf-8 too: `binmode(STDOUT,":utf8"); open(FILE,"<:utf8", "$file"); my @Lines = <FILE>; close(FILE); $page = "<html>". join("",@Lines) ."</html>"; print "Content-Type: text/html; encoding=utf-8\n\n"; print $page;` [download] See also perlunicode update: also perlio - and note that you should never set the :raw layer (i.e. use binmode(FILEHANDLE) - without a second argument) on a (unicode) text file unless you're sure you know what you're doing. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]
Re^2: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html... by isync (Hermit) on Aug 27, 2007 at 23:24 UTC
Woa! Got it!! Was a stupid error: `open(FILE,"<:utf8", "$file"); binmode(FILE);` [download] Your hint "never set the :raw layer (i.e. use binmode(FILEHANDLE) - without a second argument)" in mind I could spot this double-declaration in my code. The problem is always in front of the screen...	[reply] [d/l]
Re^2: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html... by isync (Hermit) on Aug 27, 2007 at 23:06 UTC
Doh! I need to engrave this :utf8 thing on filehandles in wood! I forget it all the time... But, it did not quite solve the problem. So I tried Ikegami's suggestion (see below).	[reply]
Re: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html... by ikegami (Patriarch) on Aug 27, 2007 at 19:00 UTC
Are you sure the simplified snippet actually reproduces the problem? That the file handle is referenced as both `File` and as `FILE` brings your assumption into question. What does the following show when you put "föö" in foo.txt? `use Data::Dumper qw( Dumper ); use HTML::Entities qw( encode_entities ); open(FILE, '<', 'file.txt'); binmode(FILE); my $content; { local $/; $content = <FILE>; } close(FILE); print "Content-Type: text/html; encoding=utf-8\n\n"; local $Data::Dumper::Useqq = 1; print '<pre>'; print encode_entities(Dumper($content)); print '</pre>'; print $content;` [download]	[reply] [d/l] [select]
Re^2: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html... by isync (Hermit) on Aug 27, 2007 at 23:15 UTC
`open(FILE, '<', 'file.txt'); # binmode(FILE,":utf8"); my $content; { local $/; $content = <FILE>; } close(FILE); # binmode(STDOUT,":utf8"); use Data::Dumper; use HTML::Entities; print "Content-Type: text/html; encoding=utf-8\n\n"; local $Data::Dumper::Useqq = 1; print '<pre>'; print encode_entities(Dumper($content)); print '</pre>'; print $content;` [download] gives me: `$VAR1 = "f\303\266\303\266\n"; föö` [download] (when viewed in encoding=utf-8) `open(FILE, '<', 'file.txt'); binmode(FILE,":utf8"); my $content; { local $/; $content = <FILE>; } close(FILE); binmode(STDOUT,":utf8"); use Data::Dumper; use HTML::Entities; print "Content-Type: text/html; encoding=utf-8\n\n"; local $Data::Dumper::Useqq = 1; print '<pre>'; print encode_entities(Dumper($content)); print '</pre>'; print $content;` [download] gave me: `$VAR1 = "f\x{f6}\x{f6}\n"; föö` [download] (but I needed to set encoding=utf-8 manually on this one, was western before..) The bad news is: setting the input filehandle and the stdout to :utf8 in my more complex script gives no positive change. Is that an indicator that the string gets "compromised" with non-utf8 somewhere in between? Any ideas? UPDATE: Solved! Woa! Got it!! Was a stupid error: `open(FILE,"<:utf8", "$file"); binmode(FILE);` [download] Joost's hint "never set the :raw layer (i.e. use binmode(FILEHANDLE) - without a second argument)" in mind I could spot this double-declaration in my code. The problem is always in front of the screen... Nevertheless, should it be $VAR1 = "f\303\266\303\266\n"; or $VAR1 = "f\x{f6}\x{f6}\n"; for proper utf8 output? (the latter, right?)	[reply] [d/l] [select]
Re^3: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html... by ikegami (Patriarch) on Aug 27, 2007 at 23:35 UTC
`f\303\266\303\266\n` is UTF-8 encoded. If it's a string of chars (the UTF-8 flag is set), you'll get UTF-8 when you print to a UTF-8 filehandle. If it's a string of octets (the UTF-8 flag is clear), you'll get UTF-8 when you print to a raw filehandle. `f\x{f6}\x{f6}\n` is iso-latin-1 encoded. When you print to a UTF-8 filehandle, Perl will assume it's iso-latin-1 and convert it to UTF-8. When you print to a raw filehandle, you'll get those exact octets.	[reply] [d/l] [select]
Re^4: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html... by isync (Hermit) on Aug 28, 2007 at 09:54 UTC
Re^5: Reading in utf-8 txt file gives garbled data when printed as part of utf-8 html... by Anonymous Monk on Apr 21, 2009 at 23:49 UTC
Some notes below your chosen depth have not been shown here