Accented characters

ackmanx has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Accented characters by CountOrlok (Friar) on Feb 10, 2006 at 05:28 UTC
This is a "feature" of the Windows Command Prompt Window. It uses the "DOS: United States" character set and not Unicode. é and ç are hex E9 and E7 respectively in Unicode and Windows:Western. If you open up the Windows "Character Map" program and go to the Lucida Console font and choose the "DOS: United States" character set, you will find that the greek characters Θ and τ map to hex E9 and E7. Solution? I'd say use cygwin. Or maybe there is a way to tweak the character set for a Command Prompt in the registry, that I do not know. -imran	[reply]
Re: Accented characters by thundergnat (Deacon) on Feb 10, 2006 at 12:06 UTC
Perl uses iso-Latin-1 (or utf-8) encoding internally, Windows cmd.exe seems to still be using the DOS codepages. So you need to re encode your string to the appropriate code page before you print it. Assuming the US DOS code page 437: `use Encode; my $string = 'é ç î ù'; Encode::from_to($string, 'iso-8859-1', 'cp437'); print $string;` [download] If you are using an international version of Windows, you may need to use a different code page.	[reply] [d/l]
Re^2: Accented characters by ackmanx (Initiate) on Feb 10, 2006 at 17:00 UTC
Thanks for you help guys, I'll try out those solutions when I get home from school to see if they work.	[reply]
Re^2: Accented characters by Errto (Vicar) on Feb 10, 2006 at 19:39 UTC
Peculiar. I had thought Perl always used UTF-8 internally, even if literal strings are parsed as Latin-1, which they are unless you say `use utf8;`. By that logic your example shouldn't work, but I tested it and it does (ActivePerl 5.8.6, Windows XP). My own proposal was going to be: `use Encode; my $string = 'é ç î ù'; $string = Encode::encode('cp437', $string); print $string;` [download] which, curiously enough, also works.	[reply] [d/l] [select]
Re^3: Accented characters by thundergnat (Deacon) on Feb 10, 2006 at 21:06 UTC
I am certainly no expert in Perl internals. In my understanding, perl treats strings as if they are Latin-1 unless it can't (it contains a character that isn't in Latin-1) or you have set a locale or pragma which forces it otherwise. It can be pretty hard to catch perl at this. One way is to use the bizarre (in my opinion) fact that the perl regex engine will recognize a non-breaking space with a \s assertion if the string is encoded as utf-8 but not if it is encoded as Latin-1. Run the following little script to demonstrate. (You don't want to know how much hair I pulled out before I figured this one out...) `for my $string ( "< Á>", "< Á\x{0100}>" ) { print "Has ", $string =~ /\s/ ? 'a' : 'no', " space.\n"; }` [download] Those strings are "<\x{A0}\x{C1}>" and "<\x{A0}\x{C1}\x{100}>" if they don't show up correctly... All that being said, your solution is probably better than mine since it makes no assumptions about the encoding of the input string.	[reply] [d/l]
Re: Accented characters by NetWallah (Canon) on Feb 10, 2006 at 03:27 UTC
~~Most likely, you have the command prompt FONT set to "Lucida Console". Change that to "Raster Fonts", and you are all set.~~ How to change the font: Click on the top-left corner (or hit Alt-SPACE , then select PROPERTIES, and choose the FONT tab. Then select the font. Update1:Oops - I cant get that to print correctly anymore. It DID, the first time. Update2: This has to do with WIDE characters : chr(130) = é is beyond the normal ASCII character set. I do not have an explanation for the system behaviour yet. Update3: Redirect STDOUT to a file, and view the contents in Notepad or other editor. Contents are shown correctly. "For every complex problem, there is a simple answer ... and it is wrong." --H.L. Mencken	[reply]