Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a form that accepts file uploads. A user is free to upload files with any names they like. Most file names will be in English, but some will have names with East Asian characters (Chinese, Korean, etc.)

I need to save these files to a Windows server. And I must keep their original names.

Is it necessary for Perl to know the language of the string? Is it possible to "sniff out" the language?

Thanks for any ideas on how to implement this. ikegami has already provided much help with this code:

use strict; use warnings; use Encode qw( encode ); use Symbol qw( gensym ); use Win32API::File qw( CreateFileW OsFHandleOpen CREATE_ALWAYS GENERIC_WRITE ); my $qfn = chr(0x263a); # Whatever my $win32f = CreateFileW( encode('UCS-2le', $qfn), GENERIC_WRITE, # For writing 0, # Not shared [], # Security attributes CREATE_ALWAYS, # Create and replace 0, # Special flags [], # Permission template ) or die("CreateFile: $^E\n"); OsFHandleOpen( my $fh = gensym(), $win32f, 'w' ) or die("OsFHandleOpen: $^E\n"); print $fh "Foo!\n";

Replies are listed 'Best First'.
Re: Converting East Asian strings
by CountZero (Bishop) on Apr 14, 2009 at 05:49 UTC
    Another solution would be to transliterate the filenames into good 'ole' plain ASCII with Text::Unidecode. You would loose the exact filename, but it would give you a valid filename.

    If you keep a table somewhere in a database for translating between the original name and the transliterated file name, your web-site can still show the original name (and retrieve the file with its transliterated filename).

    Any problem in computer science can be solved with another layer of indirection. (David Wheeler)

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Converting East Asian strings
by Lu. (Hermit) on Apr 14, 2009 at 13:39 UTC
    You can try to "sniff out" the encoding with Encode::Guess, if you know the range of encodings you will be confronted to.
Re: Converting East Asian strings
by Burak (Chaplain) on Apr 14, 2009 at 18:02 UTC
    I wonder why don't we have something like this in core or at least in IO::File? Anyone suggested and rejected by porters?

      As far as I'm aware, there are some plans for redoing the filename handling of Perl, but the ugly problem is that Windows is the only OS that has something remotely resembling a statement about the encoding of filenames. Unixish operating systems use "native" filename encoding, that is, simple octet streams, but there is no easy way for Perl to find out what encoding a filename uses. I don't know if at least OSX states the encoding of a filename, or even simply uses UTF8. And let's not start talking about network shares...

        ...but there is no easy way for Perl to find out what encoding a filename uses.

        Not sure why Perl would have to find out the encoding of filenames. Last time I needed such a thing, I would have been perfectly happy to have a pragma - something like use filenames "UTF-8", or so - in order to manually specify the encoding being used with filenames on the system in question.

        I mean, no need to auto-determine anything here. Perl doesn't try to determine the encoding of file's contents either, or the encoding of the script source... For this, we have IO layers and use utf8; (for the source) or use encoding "...";. In other words, here it's also in the responsibility of the programmer to specifiy all encodings being used.  Why should filenames be handled differently?