The character sets of filesystems are one of the (few) blind spots in Perls support for character sets. Perl (and Unix, go figure) consider a filename to be a null-terminated sequence of bytes without further encoding and Perl normally uses the byte-oriented APIs. Windows likely treats the file as UTF-16 or as some other charset, so there will be an API mismatch when you input the filename to your Perl script and it then turns over that name to the byte-oriented APIs.
You can try to transcode the filename using Encode or you can resort to using the 8.3 filenames (if they are enabled for the filesystem/partition in question). You can also try to use the "wide" API (OpenFileW instead of the Perl default OpenFileA) which accepts UTF-16 strings (I believe). The drawback of the wide APIs is, that the rest of Perl, like filehandles, does not know what to do with the results you get back from them.
The most likely successful approach is to use pattern matching and readdir to find the files that interest you(r program). Hopefully the byte-oriented results of readdir() will still work when passed to open().
Update: cdarke points out that OpenFile* are for 16-bit compatibility only. CreateFile* are the way to do it under the newer versions. tye's and demerphq's Win32API::File provide the functions to use those calls and the functions even return Perl filehandles!
| [reply] |
I've no idea what does it actually says, but the ActiveState PerlJP page might assist you with your issues.
| [reply] |
Given that you are using perl's readdir, and you are getting some sort of string as a result, you might want to try a little test script to show the actual byte values that are being used in the file names. Something like this would do -- and while we're at it, let's check to see if the string returned by readdir can actually be used to get information about the file and open it:
#!/usr/bin/perl
use strict;
use warnings;
(@ARGV == 1 and -d $ARGV[0]) or die "Usage: $0 pathname\n";
my $dir = shift;
opendir( DIR, $dir ) or die "$dir: $!";
while ( $_ = readdir DIR ) {
next if ( /^\.\.?$/ );
print join( " ", map { sprintf( " %02x", ord($_)) } split //, $_ )
+;
if ( -f ) {
open( I, $_ ) or do { warn "$_: $!"; next };
my $sum = 0;
$sum += length() while (<I>);
close I;
printf( " : %d bytes (%d read)\n", -s _, $sum );
}
elsif ( -d _ ) {
print " : directory\n";
}
else {
print " : not sure what this is\n"
}
}
If you don't know how to use the information that comes out of that, post a reply with a few examples of the output for non-ASCII file names (look for lines containing hex numerics greater than " 7f ").
You might also want to take a bunch of these odd-ball file names (as fetched by readdir) and concatenate them (with spaces between them) into a single long string, and pass that to the "guess_encoding" function provided by Encode::Guess -- if the characters really are non-unicode Asian or some form of unicode, there's a good chance it'll give you a correct answer, which you can then use with Encode's "decode" function, to turn the strings into perl-internal utf8 (in case that's helpful for anything). | [reply] [d/l] |
Ok, found this this
which works for readdir functionality.
use Win32::OLE qw(in);
use Encode;
Win32::OLE->Option(CP => Win32::OLE::CP_UTF8);
#Input: -dir to read files from
#Output: -array ref with files
sub ReadDirWithWin32OLE {
my $dir = shift;
#backslashes only in dir
$dir=~s-\/-\\-g;
#remove last backslash
$dir=~s-\\\s*$--;
if (not -e $dir) {
warn "dir ($dir) does not exist";
return;
}
my $fso = Win32::OLE->new("Scripting.FileSystemObject");
#won't work if $dir contains unicode chars :(
my $folder = $fso->GetFolder($dir);
if (!$folder) {
warn "Problem creating Win32::OLE (folder) object";
return;
}
my @filesFound = ();
foreach my $file (in $folder->Files) {
my $shortFilename = $file->ShortName;
#my $shortFilename = $file->Name;
$shortFilename = $dir . "\\" . $shortFilename;
if (-e $shortFilename) {
print "\nFile Found", $shortFilename;
push @filesFound, $shortFilename;
}
else {
print "\nFILE NOT FOUND!! (this should not happen):", $sho
+rtFilename;
}
}
return \@filesFound;
}
Filenames examples:
file1_刚形变.txt
file2_ מדהימה .txt
BUT...:
1)If directory path ($dir) contains weird unicode chars it won't work?!
2)We're forced to use short filenames?! The Win32API::File trick, as mentioned by Corion, used here, didn't seem to work with "weird file2" (see above)?!
3)Still no way to open specific files with unicode chars if drag and dropped into a perl/tk window (but that is maybe whole different topic?).
@graff:
readdir gives us the ? char (ascii 63) instead of *any* weird unicode char. | [reply] [d/l] |
The perl 5.10 todo wish list states that functions like
chdir, opendir, readdir, readlink, rename, rmdir e.g "could potentially accept Unicode filenames either as input or output".
Windows default encoding is UTF-16LE,but the console 'dir' command will only return ANSI names.Thus unicode characters are replaced with "?",even if you invoke the console using the unicode switch (cmd.exe /u),change the codepage to 65001 which is utf8 on windows and use lucida console true type font which supports unicode.
A workaround is to use the com facilities provided by windows (in this case Scripting.FileSystemObject) which provide a much higher level of abstraction or use the api as pointed out in this thread.
Based on your query as an initiative I tried to read a file with japanese characters in the filename which resides in the current folder and then move the file to another folder.
The filename is "は群馬県高崎市を拠点に、様々なメディ.txt"Just create a new file and copy/paste this as a filename.(I don't know what it means,I just googled for 'japanese' and this turned up!so don't flame me if it means something bad!!) and you have to have the appropriate fonts.
Since opendir ,readdir,rename etc do not support unicode you have to reside to the Scripting.FileSystemObject methods and properties which accept unicode.
This is the actual code :
use Win32::OLE qw(in);
use Devel::Peek;
#CP_UTF8 is very important as it translates between Perl strings and U
+nicode strings used by the OLE interface
Win32::OLE->Option(CP => Win32::OLE::CP_UTF8);
$obj = Win32::OLE->new('Scripting.FileSystemObject');
$folder = $obj->GetFolder(".");
$collection= $folder->{Files};
mkdir ("c:\\newfolder")||die;
foreach $value (in $collection) {
$filename= %$value->{Name};
next if ($filename !~ /.txt/);
Dump("$filename"); #check if the utf8 flag is on
$file=$obj->GetFile("$filename");
$file->Move("c:\\newfolder\\$filename");
print (Win32::OLE->LastError() || "success\n\n");
}
What puzzles me is that you say that don't see the correct filename using explorer when you should have.This will only work if you have the asian languages (regional setings)
support enabled and you should be able to see the japanase name in explorer as above
| [reply] [d/l] |
Thanks for your tip, was searching the same question online and find this page.
To the above code, I get $file->Move work with CJK filename, the problem I have here is if $path is a path contain utf8
$folder = $obj->GetFolder($path);
does not seems to work, while if it is in Big5 it works then...
Any suggestion?
Thanks!
| [reply] |
Can you be more specific? Maybe provide a code sample and point out where the actual problem is?
Try the following script which just gets all subdirectories
and prints their name out.Does it work for the directory in question?
use Win32::OLE qw(in);
Win32::OLE->Option(CP => Win32::OLE::CP_UTF8);
$obj = Win32::OLE->new('Scripting.FileSystemObject');
$folder = $obj->GetFolder(".");
$collection= $folder->{SubFolders};
foreach $value (in $collection) {
$foldername= %$value->{Name};
$folder=$obj->GetFolder("$foldername");
print (Win32::FormatMessage(Win32::OLE->LastError())|| "$foldernam
+e");
}
| [reply] [d/l] |