Re: utf8::downgrade() and file system operators
by hippo (Archbishop) on Feb 16, 2024 at 13:46 UTC
|
The BUGS section of utf8 to which you linked begins with the qualification
Some filesystems may not support ...
I take that to mean that the behaviour you have encountered may well be dependent upon the filesystem type. It is probably worth verifying that before suggesting a fix which itself might break other things (eg. on other filesystems). If you have an SSCCE to illustrate the problem, I'm sure other monks (me included) would be happy to run your test case on other filesystems to see how they behave.
| [reply] |
|
perl -w .\TestFileOpsWithUTF8_Names.t
1..2
ok 1 - check -f with non-UTF8 file name
not ok 2 - check -f with UTF8 file name
# Failed test 'check -f with UTF8 file name'
# at .\TestFileOpsWithUTF8_Names.t line 18.
# Looks like you failed 1 test of 2.
Read more... See the code of file TestFileOpsWithUTF8_Names.t (1054 Bytes) | [reply] [d/l] [select] |
|
You forgot to create the file with the other name.
open my $fh2, q{>}, $fnameUTF8 or die "could not create file $fnameUTF
+8:$!";
close $fh2 or die "could not close file $fnameUTF8";
You might want to unlink it at the end, too.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
Re: utf8::downgrade() and file system operators
by NERDVANA (Priest) on Feb 17, 2024 at 04:37 UTC
|
Perl compiled for Windows unfortunately uses the 8-bit variants of the filesystem functions, and (except for a recent change in Win10 described below) is generally unable to use characters outside of your 8-bit codepage, whatever that happens to be. In short, while downgrade might work for the particular character you are running into here, it won't generally work.
Windows of course has a second filesystem API that uses UTF-16, and if perl used that it would have full unicode support ... but can't really do that either because of breaking backward compatibility and that perl APIs would start behaving differently between Linux and Windows. As dasgar mentions, you can access this API using Win32::LongPath. The downside is that now your script is using a funny API and is less portable.
The "recent change in Win10" is that there is now an option in the executable properties/metadata where you can set a custom codepage on the application itself (as opposed to just the codepage of its terminal) and one of those codepages is UTF-8! Giving perl.exe a codepage of UTF-8 causes all its normal filesystem functions to suddenly just start working, because the wide characters get decomposed to UTF-8 sequences when passed to a filesystem API and now Windows understands those sequences and so it all just works. This is a very recent change to Win10 and (AFAIK) strawberry perl does not yet compile this as the default codepage for perl.exe, so you have to set it yourself.
If you'd like to become a force of positive change, the right thing to ask for is for perl porters to change the Windows build settings to set the UTF-8 codepage on the executable by default. I don't know how to do this, and there's a good chance they also don't know how to do this, so if you did the research for them or submitted a patch, it would help everyone out.
It should be noted that this change would break scripts that were using upper-ascii in the local codepage! For example, if a windows perl script wrote mkdir("\x{A9}") (Latin-1 copyright symbol) as a single byte, perl would not know that it needed encoded as UTF-8 before passing it to the mkdir() function. You would need to utf8::upgrade() or utf8::encode() it first. Or, use a unicode text editor to write the character literally in the string and declare use utf8; at the top of the file. Then, along with the patch, ask the maintainers of Strawberry to start releasing two versions of perl.exe, one with the UTF-8 codepage set, and one without.
(I'm not currently using Windows, or an active user of this new feature. I'm just relaying information I've gathered in other threads around here)
| [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
|
Yeah, that's what I was talking about. "Since 2019" wow, time flies. I would have guessed 2021.
The deeper problem is that Perl doesn't know what character encoding *any* scalar is using and can't make fully intelligent choices about when to encode as UTF-8 or when to downgrade. This corresponds to the more general Unix problem of never knowing whether a user wants their Unix byte-oriented paths and environment to have UTF-8 or some other encoding. You can *kind of* guess based on whether their $ENV{LC_ALL} =~ /utf-8/ but there doesn't seem to really be any official "all things in my system should be unicode" setting.
Windows (NT onward at least) has always had an understanding of which codepage it was operating under, and official ways to exchange unicode outside of that codepage. Perl doesn't have any way to generically tap into this knowledge without a matching understanding on Unix (or IBM AS-4000 or VMS or all the other places where perl might run) So... you're just stuck always manually preparing the correct encoding of filenames on your own. It takes an unfortunate amount of education for people to get it right, though.
I also wrote up a Meditation about unicode filenames in general.
| [reply] |
Re: utf8::downgrade() and file system operators
by dasgar (Priest) on Feb 17, 2024 at 00:54 UTC
|
I'm not a subject matter expert on this and I'm going from my memory, so I apologize for using incorrect terminology or other unintentional inaccuracies.
Windows has two different APIs for the filesystem. The default API that most programs use (including Perl) has a limitation of about 255 character length on paths and does not support Unicode characters. The other API supports both longer paths and Unicode characters.
Although it sounds like you found a work around that appears to work with the paths that you tested, I personally would be concerned that there could be some situations (or corner cases) where that might not work. If you want to leverage the alternate filesystem API, you can take a look at Win32::LongPath, which "provides replacement functions for most of the native Perl file functions". From your post, you mention stat and file test operators. The Win32::LongPath alternates are statL and testL.
I'll refrain from commenting on what should or should not be recommended to Perl porters on this topic because I don't think I understand things enough to speak intelligently on the topic.
| [reply] |
|
Thanks for bringing that second API to my attention!
I agree that utf8::downgrade() is not a general solution, because it depends on the characters being mappable. They are in my use cases for now.
I am interested in a more general solution without sacrificing portability, but if everything else fails, it is good to know that such a way out exists.
| [reply] [d/l] |