Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to process all files in a directory with readdir. Problem is with several filenames which has chars that look like "square boxes" (when viewing directory content in explorer.exe). When I copy and paste one of those filenames into some other editor here, I see bunch of jap (or chinese?) chars instead of "sqare boxes" (like the filenames were supposed to look like I guess). When I look at these filenames in the console (cmd.exe) these files are listed with bunch of ??? chars.

Anyway, perl doesn't want to open any of these files. So how can we access these files from perl? (This is with Activestate perl v5.8.8 build 817 on winxp).

Thanks.

  • Comment on Opening files with japanese/chinese chars in filename

Replies are listed 'Best First'.
Re: Opening files with japanese/chinese chars in filename
by Corion (Patriarch) on Jan 24, 2008 at 16:50 UTC

    The character sets of filesystems are one of the (few) blind spots in Perls support for character sets. Perl (and Unix, go figure) consider a filename to be a null-terminated sequence of bytes without further encoding and Perl normally uses the byte-oriented APIs. Windows likely treats the file as UTF-16 or as some other charset, so there will be an API mismatch when you input the filename to your Perl script and it then turns over that name to the byte-oriented APIs.

    You can try to transcode the filename using Encode or you can resort to using the 8.3 filenames (if they are enabled for the filesystem/partition in question). You can also try to use the "wide" API (OpenFileW instead of the Perl default OpenFileA) which accepts UTF-16 strings (I believe). The drawback of the wide APIs is, that the rest of Perl, like filehandles, does not know what to do with the results you get back from them.

    The most likely successful approach is to use pattern matching and readdir to find the files that interest you(r program). Hopefully the byte-oriented results of readdir() will still work when passed to open().

    Update: cdarke points out that OpenFile* are for 16-bit compatibility only. CreateFile* are the way to do it under the newer versions. tye's and demerphq's Win32API::File provide the functions to use those calls and the functions even return Perl filehandles!

Re: Opening files with japanese/chinese chars in filename
by Erez (Priest) on Jan 24, 2008 at 17:06 UTC
Re: Opening files with japanese/chinese chars in filename
by graff (Chancellor) on Jan 25, 2008 at 07:09 UTC
    Given that you are using perl's readdir, and you are getting some sort of string as a result, you might want to try a little test script to show the actual byte values that are being used in the file names. Something like this would do -- and while we're at it, let's check to see if the string returned by readdir can actually be used to get information about the file and open it:
    #!/usr/bin/perl use strict; use warnings; (@ARGV == 1 and -d $ARGV[0]) or die "Usage: $0 pathname\n"; my $dir = shift; opendir( DIR, $dir ) or die "$dir: $!"; while ( $_ = readdir DIR ) { next if ( /^\.\.?$/ ); print join( " ", map { sprintf( " %02x", ord($_)) } split //, $_ ) +; if ( -f ) { open( I, $_ ) or do { warn "$_: $!"; next }; my $sum = 0; $sum += length() while (<I>); close I; printf( " : %d bytes (%d read)\n", -s _, $sum ); } elsif ( -d _ ) { print " : directory\n"; } else { print " : not sure what this is\n" } }
    If you don't know how to use the information that comes out of that, post a reply with a few examples of the output for non-ASCII file names (look for lines containing hex numerics greater than " 7f ").

    You might also want to take a bunch of these odd-ball file names (as fetched by readdir) and concatenate them (with spaces between them) into a single long string, and pass that to the "guess_encoding" function provided by Encode::Guess -- if the characters really are non-unicode Asian or some form of unicode, there's a good chance it'll give you a correct answer, which you can then use with Encode's "decode" function, to turn the strings into perl-internal utf8 (in case that's helpful for anything).

      Ok, found this this which works for readdir functionality.

      use Win32::OLE qw(in); use Encode; Win32::OLE->Option(CP => Win32::OLE::CP_UTF8); #Input: -dir to read files from #Output: -array ref with files sub ReadDirWithWin32OLE { my $dir = shift; #backslashes only in dir $dir=~s-\/-\\-g; #remove last backslash $dir=~s-\\\s*$--; if (not -e $dir) { warn "dir ($dir) does not exist"; return; } my $fso = Win32::OLE->new("Scripting.FileSystemObject"); #won't work if $dir contains unicode chars :( my $folder = $fso->GetFolder($dir); if (!$folder) { warn "Problem creating Win32::OLE (folder) object"; return; } my @filesFound = (); foreach my $file (in $folder->Files) { my $shortFilename = $file->ShortName; #my $shortFilename = $file->Name; $shortFilename = $dir . "\\" . $shortFilename; if (-e $shortFilename) { print "\nFile Found", $shortFilename; push @filesFound, $shortFilename; } else { print "\nFILE NOT FOUND!! (this should not happen):", $sho +rtFilename; } } return \@filesFound; }
      Filenames examples:
      file1_刚形变.txt
      file2_ מדהימה .txt

      BUT...:
      1)If directory path ($dir) contains weird unicode chars it won't work?!
      2)We're forced to use short filenames?! The Win32API::File trick, as mentioned by Corion, used here, didn't seem to work with "weird file2" (see above)?!
      3)Still no way to open specific files with unicode chars if drag and dropped into a perl/tk window (but that is maybe whole different topic?).

      @graff:
      readdir gives us the ? char (ascii 63) instead of *any* weird unicode char.

Re: Opening files with japanese/chinese chars in filename
by nikosv (Deacon) on Feb 12, 2008 at 08:30 UTC
    The perl 5.10 todo wish list states that functions like chdir, opendir, readdir, readlink, rename, rmdir e.g
    "could potentially accept Unicode filenames either as input or output".
    Windows default encoding is UTF-16LE,but the console 'dir' command will only return ANSI names.Thus unicode characters are replaced with "?"
    ,even if you invoke the console using the unicode switch (cmd.exe /u),change the codepage to 65001 which is utf8 on windows
    and use lucida console true type font which supports unicode.
    A workaround is to use the com facilities provided by windows (in this case Scripting.FileSystemObject) which provide a much higher level of abstraction
    or use the api as pointed out in this thread.
    Based on your query as an initiative I tried to read a file with japanese characters in the filename which resides in the current folder and then move the file to another folder.
    The filename is "は群馬県高崎市を拠点に、様々なメディ.txt"
    Just create a new file and copy/paste this as a filename.(I don't know what it means,I just googled for 'japanese' and this turned up!so don't flame me if it means something bad!!)
    and you have to have the appropriate fonts. Since opendir ,readdir,rename etc do not support unicode you have to reside to the Scripting.FileSystemObject methods and properties which accept unicode.
    This is the actual code :
    use Win32::OLE qw(in); use Devel::Peek; #CP_UTF8 is very important as it translates between Perl strings and U +nicode strings used by the OLE interface Win32::OLE->Option(CP => Win32::OLE::CP_UTF8); $obj = Win32::OLE->new('Scripting.FileSystemObject'); $folder = $obj->GetFolder("."); $collection= $folder->{Files}; mkdir ("c:\\newfolder")||die; foreach $value (in $collection) { $filename= %$value->{Name}; next if ($filename !~ /.txt/); Dump("$filename"); #check if the utf8 flag is on $file=$obj->GetFile("$filename"); $file->Move("c:\\newfolder\\$filename"); print (Win32::OLE->LastError() || "success\n\n"); }

    What puzzles me is that you say that don't see the correct filename using explorer when you should have.
    This will only work if you have the asian languages (regional setings) support enabled and you should be able to see the japanase name in explorer as above
      Thanks for your tip, was searching the same question online and find this page.
      To the above code, I get $file->Move work with CJK filename, the problem I have here is if $path is a path contain utf8 $folder = $obj->GetFolder($path);
      does not seems to work, while if it is in Big5 it works then...

      Any suggestion? Thanks!

        Can you be more specific? Maybe provide a code sample and point out where the actual problem is?

        Try the following script which just gets all subdirectories and prints their name out.
        Does it work for the directory in question?

        use Win32::OLE qw(in); Win32::OLE->Option(CP => Win32::OLE::CP_UTF8); $obj = Win32::OLE->new('Scripting.FileSystemObject'); $folder = $obj->GetFolder("."); $collection= $folder->{SubFolders}; foreach $value (in $collection) { $foldername= %$value->{Name}; $folder=$obj->GetFolder("$foldername"); print (Win32::FormatMessage(Win32::OLE->LastError())|| "$foldernam +e"); }