Re: Character encoding in console in Windows
by Anonymous Monk on Sep 11, 2010 at 18:49 UTC
|
using binmode on STDIN doesn't change how the console (cmd.exe) behaves. Unless you've configured cmd.exe to supply utf, you won't get utf ... also perl on win32 needs something like Win32::Unicode for unicode filenames | [reply] |
|
|
Thank you for your answer.
I was hoping this problem was not inherent in CMD.exe...
Are you saying that there is no way for a user to drag and drop a file with a non-ASCII name into the console window and have the perl script open that file? Or would installing Win32::Unicode fix this automagically?
"Fixing" cmd.exe is not an option as this script is not intended for my own use. It's an open source script that will mostly be used by computer illiterate people.
It's not really clear to me how the module you linked works... would I need to change my filehandle open and print commands or would it do its thing in the background without the need to change the script itself?
| [reply] |
|
|
I was hoping this problem was not inherent in CMD.exe...
Are you saying that there is no way for a user to drag and drop a file with a non-ASCII name into the console window and have the perl script open that file?
To my knowledge, you've never been able to drag & drop a file into a console window and have anything happen.
It not a perl "problem". Not a cmd.exe "problem". Just something that has never been designed in. Where did you get the idea it was possible?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
|
|
|
|
|
|
|
|
Re: Character encoding in console in Windows
by ikegami (Patriarch) on Sep 13, 2010 at 16:01 UTC
|
binmode STDIN, ':encoding(UTF-8)';
First, you're assuming the terminal uses UTF-8. That's unlikely on Windows and not necessarily true on unix. chcp will tell you the current code page on a Windows system (usally cp1252).
Second, Perl's file operators expect file names to be string of bytes. If you decode them, you'll need to re-encode them.
Finally, If you have to deal with files whose name contain characters that don't exist in your code page,
you'll encounter a third problem. See Re: stat() and utf8 filenames on Win32 fails for me, why? for a bit on that.
| [reply] [d/l] [select] |
|
|
First, you're assuming the terminal uses UTF-8.
Well, binmode STDIN was a first shot in the dark to see if it fixes or changes anything, not a carefully analysed solution. I know it doesn't work.
Anyway, it looks there is no one-liner to solve this so I'm calling it a day. (For example, your writeup says Decode the file name from whatever encoding your source uses... well, I have a horrible feeling that the encoding from CMD.exe will differ based on the localization of the OS so there is no solution that will work for every Windows computer.) Even if there is a way to do this, it sounds like it would take more research than it's worth. It's quite odd though that there is no simple, tried and tested universal solution for what I'd call pretty basic functionality. It just goes to show what an inexcusable, horrid mess encoding is in general.
Even if we were to forget about opening files based on input from the console and just hardcode the path into the perl script, it looks like one would need to use Win32API::File and at least createFile and OsFHandleOpen, or God knows what else. Half of your post on this went right over my head, to be honest.
| [reply] |
Re: Character encoding in console in Windows
by nikosv (Deacon) on Sep 13, 2010 at 11:08 UTC
|
It seems that the argv[] of the console works with non-Unicode character sets;this means that it uses the OEM or ANSI set that the console is currently set.
Check this superb article on Unicode and the console:
Unicode CMD Code Page Checkup For your case look at point 6
| [reply] |
Re: Character encoding in console in Windows
by ikegami (Patriarch) on Sep 14, 2010 at 19:13 UTC
|
You probably don't need a general solution, in which case I was over complicating things. Take two.
If you're entering the file name by dragging and dropping, Windows should be encoding them using the console's code page, which means it's already properly encoded for use by open(). Don't do any decoding, and it should work.
If you follow the above advice and you also want to be able to enter the file name using STDIN or via @ARGV, you'll have to enter it as it shows up in the results of dir. It will be impossible to enter some exotic file names this way (without using it's short file name as shown in dir /x).
If this is unacceptable, let me know which aspects of it isn't.
If it doesn't work, run the following program, and give me the output and the name of the file as you see it in explorer ("My Computer" or the likes).
use strict;
use warnings;
use Data::Dumper qw( Dumper );
print("Enter file path> ");
chomp( my $qfn = <STDIN> );
{
local $Data::Dumper::Useqq = 1;
local $Data::Dumper::Terse = 1;
local $Data::Dumper::Indent = 0;
print('$qfn=', Dumper($qfn), "\n");
}
open(my $fh, '<', $qfn)
or die("open $qfn: $!\n");
| [reply] [d/l] [select] |
|
|
Thanks for the more dumb-proof explanation, although I'm afraid it's still not dumb-proof enough for me.
If you're entering the file name by dragging and dropping, Windows should be encoding them using the console's code page, which means it's already properly encoded for use by open(). Don't do any decoding, and it should work.
Drag&drop would indeed be sufficient for me, although I'm not sure if there's any difference between typing a path in the console and having it pasted in with drag & drop. Either way, I certainly don't need @ARGV filename input at all.
You say drag and drop should work with plain old open(), but this is exactly what I originally tried and it failed, hence this tread. To make things clear, here's a simplified script showing what I do. It doesn't even parse the file name for path/filename/extension, it just strips whitespace and quotes and tests if the perl script can find the file.
#!/usr/bin/perl
use strict;
use warnings;
print "Drop input file in console\n";
chomp (my $file = <STDIN>);
$file =~ s/ *["']?([^"']*)["']? *$/$1/; # strip whitespace and
+quotes
if (-e "$file") {print "\nFile found, everything is fine.\nFile:>$file
+<"} else {print "\nOoops, file not found.\nFile: >$file<"}
<STDIN>;
If I drop in a file called i.txt, it finds it, so far so good. If I drop in í.txt from the same folder, the script can't find it. open (FILE, "<:encoding(UTF-8)", "$file"); fails on í.txt just the same as (-e "$file").
Are you saying this should work out of the box? Or should I use one of the modules? Win32::Unicode::File? Or Win32API::File?
This command in Win32::Unicode::File seems to promise to do what I want:
my $fh = Win32::Unicode::File->new($mode, $file_name); # create an instance and open the file
but it's certainly not just a plain open() like you say. I haven't installed the module so I haven't tried it.
Here's the output from your script, if I start it in the folder where í.txt is saved:
Enter file path> í.txt
$qfn="\241.txt"
open í.txt: No such file or directory
| [reply] [d/l] [select] |
|
|
>perl -MDevel::Peek -e"chomp($_=<STDIN>); Dump($_); open($fh, '<', $_)
+ or die; print <$fh>"
C:\Users\ikegami\í.txt
SV = PV(0xeaf80) at 0x2f59c8
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2f3788 "C:\\Users\\ikegami\\\241.txt"\0
CUR = 22
LEN = 80
Died at -e line 1, <STDIN> line 1.
Is that encoded for my ANSI code page (1252) or my OEM code page (437)?
>perl -MDevel::Peek -MEncode -e"Dump(encode('cp1252', chr(0xED)))
SV = PV(0x2fa570) at 0x27b210
REFCNT = 1
FLAGS = (TEMP,POK,pPOK)
PV = 0x32c24f8 "\355"\0
CUR = 1
LEN = 8
>perl -MDevel::Peek -MEncode -e"Dump(encode('cp437', chr(0xED)))
SV = PV(0x30a570) at 0x28b210
REFCNT = 1
FLAGS = (TEMP,POK,pPOK)
PV = 0x32c24f8 "\241"\0
CUR = 1
LEN = 8
So open() expects the name to be encoded using the ANSI code page, but it's coming from STDIN in the OEM code page.
>perl -MDevel::Peek -MEncode=from_to -e"chomp($_=<STDIN>); from_to($_,
+ 'cp437', 'cp1252'); Dump($_); open($fh, '<', $_) or die; print <$fh>
+"
C:\Users\ikegami\í.txt
SV = PV(0x159adc0) at 0x315b58
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x344d258 "C:\\Users\\ikegami\\\355.txt"\0
CUR = 22
LEN = 80
ok
In broad strokes, the OEM code page is the encoding used by console apps, ANSI for others. Now how do you get those code pages? Good question.
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
|
You always catch me away from my Windows machine! I'll do some testing tonight.
| [reply] |