elef has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I'm asking for your help in handling accented letters in the console in Windows.

I'm writing a programme that needs to use user input, both filenames (via drag and drop) and text. Everything is fine if the filenames/text is ASCII, but if they contain accented letters like á, ő, or è then everything goes to hell.

Here's a code snippet that saves text input in a file named text.txt to see if accented letters are corrupted and tests if it can find a file with funny characters in its name or path.

This works fine for me on Ubuntu with any input and it works on Windows with ASCII input, but it fails with accented letters on Windows XP. Files with characters like í in their name are not found and I just get stuff like \x82\xA0\xFB\xA3\xA1 in the output txt, along with "does not map to Unicode" errors. If I remove the binmode STDIN line, the only difference is that I get mojibake in the output file, but still nothing works.
#!/usr/bin/perl use strict; use warnings; use utf8; binmode STDIN, ':encoding(UTF-8)'; print "Type some funny characters, they will be saved in text.txt\n"; chomp (my $text = <STDIN>); print "\n\nYou typed: $text\n"; # open (OUT, ">:encoding(UTF-8)", "text.txt") or die "Can't open file: + $!"; open (OUT, ">", "text.txt") or die "Can't open file: $!"; print OUT "$text\n"; close OUT; my ($inputfile_full, $folder, $inputfile, $inputfile_noext, $ext); do { print "\nDrag and drop the input file here and press enter.\n"; chomp ($inputfile_full = <STDIN>); # strip any leading and trailing spaces and single or double quote +s $inputfile_full =~ /^ *[\"\']?(.*)[\/\\]([^\"\']*)[\"\']? *$/; # $1 = everything up to last / or \, $2 = everything from there on + to the end except ",' and spaces at the end $folder = $1; $inputfile = $2; $inputfile =~ /(.*)\.(.*)/; $inputfile_noext = $1; $ext = $2; # strip quotes $inputfile_full =~ s/^ *[\"\']?([^\"\']*)[\"\']? *$/$1/; print "\nThe file doesn't seem to exist\nFilename: $inputfile\nPat +h: $folder\n\nTry again!\n\n" unless (-e "$folder/$inputfile"); }until (-e "$folder/$inputfile"); print "\nFile (${inputfile}) found.\nPath: $folder\nPress enter to con +tinue\n"; <STDIN>;

Replies are listed 'Best First'.
Re: Character encoding in console in Windows
by Anonymous Monk on Sep 11, 2010 at 18:49 UTC
    using binmode on STDIN doesn't change how the console (cmd.exe) behaves. Unless you've configured cmd.exe to supply utf, you won't get utf ... also perl on win32 needs something like Win32::Unicode for unicode filenames
      Thank you for your answer.

      I was hoping this problem was not inherent in CMD.exe...
      Are you saying that there is no way for a user to drag and drop a file with a non-ASCII name into the console window and have the perl script open that file? Or would installing Win32::Unicode fix this automagically?

      "Fixing" cmd.exe is not an option as this script is not intended for my own use. It's an open source script that will mostly be used by computer illiterate people.

      It's not really clear to me how the module you linked works... would I need to change my filehandle open and print commands or would it do its thing in the background without the need to change the script itself?

        I was hoping this problem was not inherent in CMD.exe... Are you saying that there is no way for a user to drag and drop a file with a non-ASCII name into the console window and have the perl script open that file?

        To my knowledge, you've never been able to drag & drop a file into a console window and have anything happen.

        It not a perl "problem". Not a cmd.exe "problem". Just something that has never been designed in. Where did you get the idea it was possible?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Character encoding in console in Windows
by ikegami (Patriarch) on Sep 13, 2010 at 16:01 UTC

    binmode STDIN, ':encoding(UTF-8)';

    First, you're assuming the terminal uses UTF-8. That's unlikely on Windows and not necessarily true on unix. chcp will tell you the current code page on a Windows system (usally cp1252).

    Second, Perl's file operators expect file names to be string of bytes. If you decode them, you'll need to re-encode them.

    Finally, If you have to deal with files whose name contain characters that don't exist in your code page, you'll encounter a third problem. See Re: stat() and utf8 filenames on Win32 fails for me, why? for a bit on that.

      First, you're assuming the terminal uses UTF-8.

      Well, binmode STDIN was a first shot in the dark to see if it fixes or changes anything, not a carefully analysed solution. I know it doesn't work.

      Anyway, it looks there is no one-liner to solve this so I'm calling it a day. (For example, your writeup says Decode the file name from whatever encoding your source uses... well, I have a horrible feeling that the encoding from CMD.exe will differ based on the localization of the OS so there is no solution that will work for every Windows computer.) Even if there is a way to do this, it sounds like it would take more research than it's worth.
      It's quite odd though that there is no simple, tried and tested universal solution for what I'd call pretty basic functionality. It just goes to show what an inexcusable, horrid mess encoding is in general.

      Even if we were to forget about opening files based on input from the console and just hardcode the path into the perl script, it looks like one would need to use Win32API::File and at least createFile and OsFHandleOpen, or God knows what else. Half of your post on this went right over my head, to be honest.

Re: Character encoding in console in Windows
by nikosv (Deacon) on Sep 13, 2010 at 11:08 UTC

    It seems that the argv[] of the console works with non-Unicode character sets;this means that it uses the OEM or ANSI set that the console is currently set.

    Check this superb article on Unicode and the console: Unicode CMD Code Page Checkup For your case look at point 6

Re: Character encoding in console in Windows
by ikegami (Patriarch) on Sep 14, 2010 at 19:13 UTC

    You probably don't need a general solution, in which case I was over complicating things. Take two.

    If you're entering the file name by dragging and dropping, Windows should be encoding them using the console's code page, which means it's already properly encoded for use by open(). Don't do any decoding, and it should work.

    If you follow the above advice and you also want to be able to enter the file name using STDIN or via @ARGV, you'll have to enter it as it shows up in the results of dir. It will be impossible to enter some exotic file names this way (without using it's short file name as shown in dir /x).

    If this is unacceptable, let me know which aspects of it isn't.

    If it doesn't work, run the following program, and give me the output and the name of the file as you see it in explorer ("My Computer" or the likes).

    use strict; use warnings; use Data::Dumper qw( Dumper ); print("Enter file path> "); chomp( my $qfn = <STDIN> ); { local $Data::Dumper::Useqq = 1; local $Data::Dumper::Terse = 1; local $Data::Dumper::Indent = 0; print('$qfn=', Dumper($qfn), "\n"); } open(my $fh, '<', $qfn) or die("open $qfn: $!\n");
      Thanks for the more dumb-proof explanation, although I'm afraid it's still not dumb-proof enough for me.

      If you're entering the file name by dragging and dropping, Windows should be encoding them using the console's code page, which means it's already properly encoded for use by open(). Don't do any decoding, and it should work.
      Drag&drop would indeed be sufficient for me, although I'm not sure if there's any difference between typing a path in the console and having it pasted in with drag & drop. Either way, I certainly don't need @ARGV filename input at all. You say drag and drop should work with plain old open(), but this is exactly what I originally tried and it failed, hence this tread. To make things clear, here's a simplified script showing what I do. It doesn't even parse the file name for path/filename/extension, it just strips whitespace and quotes and tests if the perl script can find the file.

      #!/usr/bin/perl use strict; use warnings; print "Drop input file in console\n"; chomp (my $file = <STDIN>); $file =~ s/ *["']?([^"']*)["']? *$/$1/; # strip whitespace and +quotes if (-e "$file") {print "\nFile found, everything is fine.\nFile:>$file +<"} else {print "\nOoops, file not found.\nFile: >$file<"} <STDIN>;

      If I drop in a file called i.txt, it finds it, so far so good. If I drop in í.txt from the same folder, the script can't find it. open (FILE, "<:encoding(UTF-8)", "$file"); fails on í.txt just the same as (-e "$file"). Are you saying this should work out of the box? Or should I use one of the modules? Win32::Unicode::File? Or Win32API::File?
      This command in Win32::Unicode::File seems to promise to do what I want: my $fh = Win32::Unicode::File->new($mode, $file_name); # create an instance and open the file
      but it's certainly not just a plain open() like you say. I haven't installed the module so I haven't tried it.

      Here's the output from your script, if I start it in the folder where í.txt is saved:
      Enter file path> í.txt $qfn="\241.txt" open í.txt: No such file or directory
        >perl -MDevel::Peek -e"chomp($_=<STDIN>); Dump($_); open($fh, '<', $_) + or die; print <$fh>" C:\Users\ikegami\í.txt SV = PV(0xeaf80) at 0x2f59c8 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x2f3788 "C:\\Users\\ikegami\\\241.txt"\0 CUR = 22 LEN = 80 Died at -e line 1, <STDIN> line 1.

        Is that encoded for my ANSI code page (1252) or my OEM code page (437)?

        >perl -MDevel::Peek -MEncode -e"Dump(encode('cp1252', chr(0xED))) SV = PV(0x2fa570) at 0x27b210 REFCNT = 1 FLAGS = (TEMP,POK,pPOK) PV = 0x32c24f8 "\355"\0 CUR = 1 LEN = 8 >perl -MDevel::Peek -MEncode -e"Dump(encode('cp437', chr(0xED))) SV = PV(0x30a570) at 0x28b210 REFCNT = 1 FLAGS = (TEMP,POK,pPOK) PV = 0x32c24f8 "\241"\0 CUR = 1 LEN = 8

        So open() expects the name to be encoded using the ANSI code page, but it's coming from STDIN in the OEM code page.

        >perl -MDevel::Peek -MEncode=from_to -e"chomp($_=<STDIN>); from_to($_, + 'cp437', 'cp1252'); Dump($_); open($fh, '<', $_) or die; print <$fh> +" C:\Users\ikegami\í.txt SV = PV(0x159adc0) at 0x315b58 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x344d258 "C:\\Users\\ikegami\\\355.txt"\0 CUR = 22 LEN = 80 ok

        In broad strokes, the OEM code page is the encoding used by console apps, ANSI for others. Now how do you get those code pages? Good question.

        You always catch me away from my Windows machine! I'll do some testing tonight.