Unicode File Names

John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

I noticed that ActiveState's Perl does not support "Wide" operating system calls in the same way as it used to.

I was hoping that it would "just work", transparantly. But Perlunicode man page says that open et.al. use byte strings, "...or UTF-8 strings if the encoding pragma has been used.".

The purpose of the encoding pragma states how to transcode byte strings to Unicode when upgrading them, as in mixed concatenation. What does that have to do with wide system calls?

More to the point, how do I use the encoding pragma to make these functions change semantics? The documentation on encoding is clear and isn't anything like that.

--John

Comment on Unicode File Names Select or Download Code

Replies are listed 'Best First'.
Re: Unicode File Names by BrowserUk (Patriarch) on Feb 08, 2005 at 19:55 UTC
Have you seen what replaced the old `perl -C` option? -C [number/list] The -C flag controls some Unicode of the Perl Unicode features. As of 5.8.1, the -C can be followed either by a number or a list of op +tion letters. The letters, their numeric values, and effects are as f +ollows; listing the letters is equal to summing the numbers. I 1 STDIN is assumed to be in UTF-8 O 2 STDOUT will be in UTF-8 E 4 STDERR will be in UTF-8 S 7 I + O + E i 8 UTF-8 is the default PerlIO layer for input streams o 16 UTF-8 is the default PerlIO layer for output streams D 24 i + o A 32 the @ARGV elements are expected to be strings encoded i +n UTF-8 L 64 normally the "IOEioA" are unconditional, the L makes them conditional on the locale environment variables (the LC_ALL, LC_TYPE, and LANG, in the order of decreasing precedence) -- if the variables indicate UTF-8, then the selected "IOEioA" are in effect [download] Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l] [select]
Re: Unicode File Names by Courage (Parson) on Feb 08, 2005 at 19:24 UTC
In my understanding the reason lies in a way OS uses those file names. Unfortunately it always re-encodes filenames on the fly, at least for Russian encoding it switches between CP866, CP1251 and unicode In my opinion this is mostly done to be compatible with C libraries, which are console-mode based. When not in Russian but in any other encoding, including Far East, things more complicated. Here is how I get away of the problem using OLE interface, with elder perl: use Win32::OLE qw(in CP_UTF8); use Win32::OLE::Const; Win32::OLE->Option(CP=>CP_UTF8); use Unicode::String qw/utf8/; my $oshell = Win32::OLE->new('Shell.Application') or die "$@"; my $f = $oshell->NameSpace(Win32::GetCwd()); print "[$f]"; my $fi = $f->Items; print $fi->Count; print "\n"; for (0 .. $fi->Count-1) { my $item = $fi->Item($_); my $name = $item->Name; my $u=utf8($name); my $s = $u->hex; $s=~s/U\+00(\w\w)/my($r,$p)=((pack 'H*',$1),$&);if($r=~m(^[()\w .;\- ++!]$)){$r}else{$p}/eg; $s=~s/(U\+[\da-f][\da-f][\da-f][\da-f])/($1)/ig; my $ren=0; $ren=1 if $s=~/U\+(?!00)/; $s=~s/[ +]//g; print "$ren\|$s\n"; if($ren){$item->{Name}=$s} } [download] It is quite possible for you to find another solution in Win32::xxxxxx modules.	[reply] [d/l]
Re^2: Unicode File Names by John M. Dlugosz (Monsignor) on Feb 08, 2005 at 21:48 UTC
Probably you're running into the OEM vs. ANSI code pages. Just using Unicode completely eliminates that problem! There are two versions of the Windows API entry points: The -A form maps the 8-bit string based on current settings. The -W form takes a 16-bit string and doesn't mess with it. Good idea using OLE! What is "elder perl" though? --John	[reply]
Re^3: Unicode File Names by Courage (Parson) on Feb 09, 2005 at 05:30 UTC
exactly right: OEM vs ANSI. But perl's `open` does not refer to Unicode file naming conventions, if I am right "elder perl" is 5.6.1, when I wrote that script. Hence `use Unicode::String;`	[reply] [d/l] [select]
Re^4: Unicode File Names by John M. Dlugosz (Monsignor) on Feb 09, 2005 at 19:31 UTC
Re^5: Unicode File Names by John M. Dlugosz (Monsignor) on Feb 09, 2005 at 19:52 UTC
Re^5: Unicode File Names by Courage (Parson) on Feb 09, 2005 at 20:23 UTC