Need help with binary mode and UTF-8 characters

philrennert1 has asked for the wisdom of the Perl Monks concerning the following question:

For the first time, I'm dealing with input from various non-English languages. To start with, I want to read in a line and then print it out without turning the characters to garbage. I've tried


open(OUT,">C:\\data");
binmode(OUT, ":utf8");
while(<IN>)
  {
    chomp;chomp;
    unless(/\S/){next}
    $_=Encode::decode('UTF-8',$_);
    print OUT $_;
  }
close(IN);
[download]

but it comes out fried. (I look at the output in Notepad++ to see what's happened.) How hard should it be to simply print out what comes in? There must be a simple way...

Comment on Need help with binary mode and UTF-8 characters Download Code

Replies are listed 'Best First'.
Re: Need help with binary mode and UTF-8 characters by ikegami (Patriarch) on Dec 10, 2009 at 02:45 UTC
Peeking at the text before you decode it doesn't make much sense. Calling chomp twice makes no sense. Calling chomp at all doesn't make sense since if you don't add back a newline. That said, neither of those problems should give you garbage. Maybe the file wasn't in UTF-8? Maybe you're viewer ins't treating the file as UTF-8? Maybe the problem is in how you open `IN`? Cleaned up code: `open(my $fh_in, '<:encoding(UTF-8)', 'C:\\data.in") or die $!; open(my $fh_out, '>:encoding(UTF-8)', 'C:\\data.out") or die $!; while (<$fh_in>) { chomp; # ... print $fh_out "$_\n"; }` [download]	[reply] [d/l] [select]
Re^2: Need help with binary mode and UTF-8 characters by philrennert1 (Novice) on Dec 10, 2009 at 14:08 UTC
Thanks. Yes, chomping twice didn't make sense. The point was to string a lot of lines together into one, so chomping once and then adding \n at the end did. Yes, I needed to add `binmode(IN, ':encoding(UTF-8)'); and binmode(OUT, ':encoding(UTF-8)');` [download]	[reply] [d/l]
Re: Need help with binary mode and UTF-8 characters by desemondo (Hermit) on Dec 10, 2009 at 01:57 UTC
Have a gander at PerlUniTut I suspect your problem is that you need to tell Perl that your input source is Unicode. eg. `use Encode qw(encode decode); while ( my $readline = <$fh> ){ my $foo = decode('UTF-8', $readline); }` [download] etc. I could be way off though... Update: Just realised... Unless you actually mean to change the encoding of your data from UTF-8 into something else, you'll need to re-encode it into UTF-8 after you've finished processing it in Perl. `use Encode qw(encode decode); while ( my $readline = <$fh> ){ my $foo = decode('UTF-8', $readline); #do stuff here $foo = encode('UTF-8', $foo); print {$output_fh} $foo; }` [download]	[reply] [d/l] [select]
Re^2: Need help with binary mode and UTF-8 characters by philrennert1 (Novice) on Dec 10, 2009 at 14:04 UTC
Okay, thanks for the pointer: that tutorial did it. Yes, I wasn't telling Perl about the Unicode. Adding `binmode(IN, ':encoding(UTF-8)'); and binmode(OUT, ':encoding(UTF-8)');` [download] immediately after opening these files did it.	[reply] [d/l]
Re^3: Need help with binary mode and UTF-8 characters by ikegami (Patriarch) on Dec 10, 2009 at 14:56 UTC
Actually, you were. The "decode" did the same thing as the first binmode.	[reply]