Re: Accented characters

Perl uses iso-Latin-1 (or utf-8) encoding internally, Windows cmd.exe seems to still be using the DOS codepages. So you need to re encode your string to the appropriate code page before you print it.

Assuming the US DOS code page 437:

use Encode;
my $string = 'é ç î ù';
Encode::from_to($string, 'iso-8859-1', 'cp437');
print $string;
[download]

If you are using an international version of Windows, you may need to use a different code page.

Comment on Re: Accented characters Download Code

Replies are listed 'Best First'.
Re^2: Accented characters by ackmanx (Initiate) on Feb 10, 2006 at 17:00 UTC
Thanks for you help guys, I'll try out those solutions when I get home from school to see if they work.	[reply]
Re^2: Accented characters by Errto (Vicar) on Feb 10, 2006 at 19:39 UTC
Peculiar. I had thought Perl always used UTF-8 internally, even if literal strings are parsed as Latin-1, which they are unless you say `use utf8;`. By that logic your example shouldn't work, but I tested it and it does (ActivePerl 5.8.6, Windows XP). My own proposal was going to be: `use Encode; my $string = 'é ç î ù'; $string = Encode::encode('cp437', $string); print $string;` [download] which, curiously enough, also works.	[reply] [d/l] [select]
Re^3: Accented characters by thundergnat (Deacon) on Feb 10, 2006 at 21:06 UTC
I am certainly no expert in Perl internals. In my understanding, perl treats strings as if they are Latin-1 unless it can't (it contains a character that isn't in Latin-1) or you have set a locale or pragma which forces it otherwise. It can be pretty hard to catch perl at this. One way is to use the bizarre (in my opinion) fact that the perl regex engine will recognize a non-breaking space with a \s assertion if the string is encoded as utf-8 but not if it is encoded as Latin-1. Run the following little script to demonstrate. (You don't want to know how much hair I pulled out before I figured this one out...) `for my $string ( "< Á>", "< Á\x{0100}>" ) { print "Has ", $string =~ /\s/ ? 'a' : 'no', " space.\n"; }` [download] Those strings are "<\x{A0}\x{C1}>" and "<\x{A0}\x{C1}\x{100}>" if they don't show up correctly... All that being said, your solution is probably better than mine since it makes no assumptions about the encoding of the input string.	[reply] [d/l]