telmonks has asked for the wisdom of the Perl Monks concerning the following question:

I have written a program that makes a concordance out of texts in the Wolof language. This language has acute grave accents, diaresis, and n tildes - all of which work fine. But the Wolof alphabet also include an n with a right backwards tail: ŋ. This is unicode 0143, utf8 C58B.

Every file with this code in it produces messages about Wide character in print at conc.pl line 99, <FILE> line 198993. and the output formatting is messed up for other lines. I am still looking at what this might be, but it only seems to uccur in the files with that character...

I am doing all the "right" things, I think.
My perl version is (revision 5 version 10 subversion 0)
My env has LANG=en_US.UTF-8.
My code has use utf8
My input is opened open (FILE, "<:encoding(utf8)", $filename)
My output is STDOUT

I can get rid of the errors by using binmode STDOUT, ":encoding(utf8)" but the output is still messed up.

Any clues gratefully received...

Replies are listed 'Best First'.
Re: A 'special' character in utf8?
by moritz (Cardinal) on Jun 26, 2010 at 06:06 UTC
    I can get rid of the errors by using binmode STDOUT, ":encoding(utf8)" but the output is still messed up.

    The binmode call is the right thing to do.

    Maybe your console doesn't properly work with UTF-8 output? Try to execute this: $ perl -Mcharnames=:full -CS -wle 'print "\N{EURO SIGN}"'

    If this shows a Euro sign, everything is fine with your console, and it's indeed a problem in your script. If not, you know what needs fixing.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: A 'special' character in utf8?
by ikegami (Patriarch) on Jun 26, 2010 at 03:29 UTC
    use open ':std', ':encoding(UTF-8)';

    That will add :encoding(UTF-8) to STDIN, STDOUT and STDERR.

    Update: I just noticed the last line:

    I can get rid of the errors by using binmode STDOUT, ":encoding(utf8)" but the output is still messed up.

    That's what you should be done. What I suggested about is the same thing. Using :encoding(utf-8) on both your input and your output is indeed the way to go. (utf8 is similar, and will work too.) If your output is messed up, we'll need more info, such as a dump (od -t x1) of the offending data in the input file and a dump of the corresponding messed up output.