Converting East Asian strings

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a form that accepts file uploads. A user is free to upload files with any names they like. Most file names will be in English, but some will have names with East Asian characters (Chinese, Korean, etc.)

I need to save these files to a Windows server. And I must keep their original names.

Is it necessary for Perl to know the language of the string? Is it possible to "sniff out" the language?

Thanks for any ideas on how to implement this. ikegami has already provided much help with this code:

use strict;
use warnings;
use Encode         qw( encode );
use Symbol         qw( gensym );
use Win32API::File qw(
   CreateFileW 
   OsFHandleOpen
   CREATE_ALWAYS
   GENERIC_WRITE
);

my $qfn = chr(0x263a);  # Whatever

my $win32f = CreateFileW(
   encode('UCS-2le', $qfn),
   GENERIC_WRITE,             # For writing
   0,                         # Not shared
   [],                        # Security attributes
   CREATE_ALWAYS,             # Create and replace
   0,                         # Special flags
   [],                        # Permission template
)
   or die("CreateFile: $^E\n");

OsFHandleOpen( my $fh = gensym(), $win32f, 'w' )
   or die("OsFHandleOpen: $^E\n");

print $fh "Foo!\n";
[download]

Comment on Converting East Asian strings Download Code

Replies are listed 'Best First'.
Re: Converting East Asian strings by CountZero (Bishop) on Apr 14, 2009 at 05:49 UTC
Another solution would be to transliterate the filenames into good 'ole' plain ASCII with Text::Unidecode. You would loose the exact filename, but it would give you a valid filename. If you keep a table somewhere in a database for translating between the original name and the transliterated file name, your web-site can still show the original name (and retrieve the file with its transliterated filename). Any problem in computer science can be solved with another layer of indirection. (David Wheeler) CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^2: Converting East Asian strings by ikegami (Patriarch) on Apr 14, 2009 at 06:05 UTC
Good idea, but I believe the problem the OP is having is determining the encoding of the file name. See Re^4: Saving file name with Chinese characters. Without that bit of information, Text::Unidecode can't be used any more than `CreateFileW`.	[reply] [d/l]
Re: Converting East Asian strings by Lu. (Hermit) on Apr 14, 2009 at 13:39 UTC
You can try to "sniff out" the encoding with Encode::Guess, if you know the range of encodings you will be confronted to.	[reply]
Re: Converting East Asian strings by Burak (Chaplain) on Apr 14, 2009 at 18:02 UTC
I wonder why don't we have something like this in core or at least in IO::File? Anyone suggested and rejected by porters?	[reply]
Re^2: Converting East Asian strings by Corion (Patriarch) on Apr 14, 2009 at 18:36 UTC
As far as I'm aware, there are some plans for redoing the filename handling of Perl, but the ugly problem is that Windows is the only OS that has something remotely resembling a statement about the encoding of filenames. Unixish operating systems use "native" filename encoding, that is, simple octet streams, but there is no easy way for Perl to find out what encoding a filename uses. I don't know if at least OSX states the encoding of a filename, or even simply uses UTF8. And let's not start talking about network shares...	[reply]
Re^3: Converting East Asian strings by almut (Canon) on Apr 14, 2009 at 19:38 UTC
...but there is no easy way for Perl to find out what encoding a filename uses. Not sure why Perl would have to find out the encoding of filenames. Last time I needed such a thing, I would have been perfectly happy to have a pragma - something like `use filenames "UTF-8"`, or so - in order to manually specify the encoding being used with filenames on the system in question. I mean, no need to auto-determine anything here. Perl doesn't try to determine the encoding of file's contents either, or the encoding of the script source... For this, we have IO layers and `use utf8;` (for the source) or `use encoding "...";`. In other words, here it's also in the responsibility of the programmer to specifiy all encodings being used. Why should filenames be handled differently?	[reply] [d/l] [select]