philrennert1 has asked for the wisdom of the Perl Monks concerning the following question:

For the first time, I'm dealing with input from various non-English languages. To start with, I want to read in a line and then print it out without turning the characters to garbage. I've tried
open(OUT,">C:\\data"); binmode(OUT, ":utf8"); while(<IN>) { chomp;chomp; unless(/\S/){next} $_=Encode::decode('UTF-8',$_); print OUT $_; } close(IN);
but it comes out fried. (I look at the output in Notepad++ to see what's happened.) How hard should it be to simply print out what comes in? There must be a simple way...

Replies are listed 'Best First'.
Re: Need help with binary mode and UTF-8 characters
by ikegami (Patriarch) on Dec 10, 2009 at 02:45 UTC

    Peeking at the text before you decode it doesn't make much sense.

    Calling chomp twice makes no sense.

    Calling chomp at all doesn't make sense since if you don't add back a newline.

    That said, neither of those problems should give you garbage. Maybe the file wasn't in UTF-8? Maybe you're viewer ins't treating the file as UTF-8? Maybe the problem is in how you open IN?

    Cleaned up code:

    open(my $fh_in, '<:encoding(UTF-8)', 'C:\\data.in") or die $!; open(my $fh_out, '>:encoding(UTF-8)', 'C:\\data.out") or die $!; while (<$fh_in>) { chomp; # ... print $fh_out "$_\n"; }
      Thanks. Yes, chomping twice didn't make sense. The point was to string a lot of lines together into one, so chomping once and then adding \n at the end did. Yes, I needed to add
      binmode(IN, ':encoding(UTF-8)'); and binmode(OUT, ':encoding(UTF-8)');
Re: Need help with binary mode and UTF-8 characters
by desemondo (Hermit) on Dec 10, 2009 at 01:57 UTC
    Have a gander at PerlUniTut

    I suspect your problem is that you need to tell Perl that your input source is Unicode.

    eg.
    use Encode qw(encode decode); while ( my $readline = <$fh> ){ my $foo = decode('UTF-8', $readline); }
    etc. I could be way off though...

    Update:
    Just realised... Unless you actually mean to change the encoding of your data from UTF-8 into something else, you'll need to re-encode it into UTF-8 after you've finished processing it in Perl.

    use Encode qw(encode decode); while ( my $readline = <$fh> ){ my $foo = decode('UTF-8', $readline); #do stuff here $foo = encode('UTF-8', $foo); print {$output_fh} $foo; }
      Okay, thanks for the pointer: that tutorial did it. Yes, I wasn't telling Perl about the Unicode. Adding
      binmode(IN, ':encoding(UTF-8)'); and binmode(OUT, ':encoding(UTF-8)');
      immediately after opening these files did it.
        Actually, you were. The "decode" did the same thing as the first binmode.