ackmanx has asked for the wisdom of the Perl Monks concerning the following question:

print "é ç î ù";

When I run that line in Perl Builder IDE it displays correctly. However, when I run that line in the command prompt of Windows XP I get "Θ τ ε ∙". How can I fix it? My script is useless with that bug.

Thanks

Replies are listed 'Best First'.
Re: Accented characters
by CountOrlok (Friar) on Feb 10, 2006 at 05:28 UTC
    This is a "feature" of the Windows Command Prompt Window. It uses the "DOS: United States" character set and not Unicode. é and ç are hex E9 and E7 respectively in Unicode and Windows:Western.

    If you open up the Windows "Character Map" program and go to the Lucida Console font and choose the "DOS: United States" character set, you will find that the greek characters Θ and τ map to hex E9 and E7.

    Solution? I'd say use cygwin. Or maybe there is a way to tweak the character set for a Command Prompt in the registry, that I do not know.

    -imran

Re: Accented characters
by thundergnat (Deacon) on Feb 10, 2006 at 12:06 UTC

    Perl uses iso-Latin-1 (or utf-8) encoding internally, Windows cmd.exe seems to still be using the DOS codepages. So you need to re encode your string to the appropriate code page before you print it.

    Assuming the US DOS code page 437:

    use Encode; my $string = 'é ç î ù'; Encode::from_to($string, 'iso-8859-1', 'cp437'); print $string;

    If you are using an international version of Windows, you may need to use a different code page.

      Thanks for you help guys, I'll try out those solutions when I get home from school to see if they work.
      Peculiar. I had thought Perl always used UTF-8 internally, even if literal strings are parsed as Latin-1, which they are unless you say use utf8;. By that logic your example shouldn't work, but I tested it and it does (ActivePerl 5.8.6, Windows XP). My own proposal was going to be:
      use Encode; my $string = 'é ç î ù'; $string = Encode::encode('cp437', $string); print $string;
      which, curiously enough, also works.

        I am certainly no expert in Perl internals. In my understanding, perl treats strings as if they are Latin-1 unless it can't (it contains a character that isn't in Latin-1) or you have set a locale or pragma which forces it otherwise.

        It can be pretty hard to catch perl at this. One way is to use the bizarre (in my opinion) fact that the perl regex engine will recognize a non-breaking space with a \s assertion if the string is encoded as utf-8 but not if it is encoded as Latin-1. Run the following little script to demonstrate. (You don't want to know how much hair I pulled out before I figured this one out...)

        for my $string ( "< Á>", "< Á\x{0100}>" ) { print "Has ", $string =~ /\s/ ? 'a' : 'no', " space.\n"; }

        Those strings are "<\x{A0}\x{C1}>" and "<\x{A0}\x{C1}\x{100}>" if they don't show up correctly...

        All that being said, your solution is probably better than mine since it makes no assumptions about the encoding of the input string.

Re: Accented characters
by NetWallah (Canon) on Feb 10, 2006 at 03:27 UTC
    Most likely, you have the command prompt FONT set to "Lucida Console". Change that to "Raster Fonts", and you are all set.

    How to change the font:
    Click on the top-left corner (or hit Alt-SPACE , then select PROPERTIES, and choose the FONT tab. Then select the font.

    Update1:Oops - I cant get that to print correctly anymore. It DID, the first time.

    Update2: This has to do with WIDE characters : chr(130) = é is beyond the normal ASCII character set. I do not have an explanation for the system behaviour yet.

    Update3: Redirect STDOUT to a file, and view the contents in Notepad or other editor. Contents are shown correctly.

         "For every complex problem, there is a simple answer ... and it is wrong." --H.L. Mencken