chaoslawful has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I use Tk's getOpenFile method to select file. But its returning path string is in UTF-8 encoding (without marking it as a UTF-8 string), and Perl's file manipulating functions (open, unlink, etc.) makes syscall directly without considering the current character encoding, so the syscall will fail when the current encoding isn't UTF-8 and pathname contains multilingual characters.

In order to get rid of this issue, the current character encoding of the OS needs to be detected. One solution is using POSIX::setlocale, but it is broken on Win32, for the encoding part in the locale string is the number of current codepage instead of standard encoding name.

Although it seems easy to get around this problem (just prepend "cp" to the number), I still wonder whether there're some generic(and elegant) way to detect the OS's current encoding across multiple platforms (*nix, Win32, MacOS, etc.). Or is there any general module on CPAN which can manipulating Unicode pathnames on many different OSes?

Any suggestion is welcome. Thanks a lot! :)

  • Comment on How to detect the OS's current encoding?

Replies are listed 'Best First'.
Re: How to detect the OS's current encoding?
by fenLisesi (Priest) on Mar 01, 2007 at 14:24 UTC
    I am not sure whether it is applicable to your particular problem, but you seem to be in the general vicinity of Encode::Detect, which, as a side note, my colleague was unable to build on Win32 last week. He will give it another try. fenLisesi turns toward the Monks. If there are users of this module, would you care to share your experience with it, please? Cheers.
Re: How to detect the OS's current encoding?
by dk (Chaplain) on Mar 01, 2007 at 10:32 UTC
    Let's review. Suppose you have a file with 1-byte name, "\xa1". IIUC that when getOpenFile returns this file, it returns 2-byte string, something like "\xc0\xa1", without utf8 flag, is that correct? In this case, getOpenFile seems to be flawed, and you probably should find how to make it return exact file name, without converting it to utf8.
      Yep, actually I have tried getOpenFile, Tk::FBox and Tk::FileSelect, and only Tk::FileSelect could find out the exact path without transcoding to UTF-8. It shows up a Tk-style selection dialog but can't display CJK characters correctly. Worse in the case of Tk::FBox, which also shows up Tk-style dialogs but appears to be unable to handle CJK characters in the path (error message comes out when you select one). Only getOpenFile can open OS's native file selection dialog and show non-latin1 characters out.

      In my opinion, a generic way to get the OS's current character encoding would be very helpful not only in pathname transcoding but also for many I18N applications. :)

Re: How to detect the OS's current encoding?
by zentara (Cardinal) on Mar 01, 2007 at 13:23 UTC
    I'm not sure if I understand the question, but this may help. I asked something similar to this awhile back, where there were encoding problems in finding files in a directory. graff showed me this:
    #this decode utf8 routine is used so filenames with extended # ascii characters (unicode) in filenames, will work properly use Encode; opendir my $dh, $path or warn "Error: $!"; my @files = grep !/^\.\.?$/, readdir $dh; closedir $dh; # @files = map{ "$path/".$_ } sort @files; #$_ = decode( 'utf8', $_ ) for ( @files ); @files = map { decode( 'utf8', "$path/".$_ ) } sort @files;
    I don't know if Tk's getOpenFile is buggy or not, but graff said that once you pass the filename through decode, Perl will tag it as unicode and do the right thing. Maybe you could make your own "custom-file-dialog" that preprocesses the dirlist with decode?

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
      I'm afraid making a customized file dialog would not get the encoding problem around, for we still needs to know the OS's current encoding to 'decode' pathname into UTF-8 in order to display them correctly in Tk widgets. Anyway, Tk 804.xx always using UTF-8 string internally.
        I'm still in a state of semi-confusion with all of this encoding stuff. But my understanding of the problem I had, was that the program that saved the file, didn't use the right encoding when it saved the file, so when my local system tried to read it, I would see a 2 letter set, in place of the unicode character. graff's method of decoding it, would automagically convert the filenames to be recognized by my system.

        It isn't hard to make your own custom file selector dialog, and if you do, it would be easy to run decode on the files first. Maybe you could sub-class getOpenFile, to do a decode routine first?

        See Re: problems with extended ascii characters in filenames if you are interested.

        And there is the often given tip, which I don't quite understand

        $Tk::encodeFallback=1

        Then again, I probably don't even understand your problem, :-)


        I'm not really a human, but I play one on earth. Cogito ergo sum a bum