hexcoder has asked for the wisdom of the Perl Monks concerning the following question:

Hello dear fellow monks!

As I found out the hard way on windows, file system related functions like stat or the file test operators would not work in general with funny file path strings that have the flag UTF8 set. Of course, if you compare such a variable with the literal string without a set UTF8 flag, they are reported as equal, but the effect with e.g. -f is different!
The only documentation I could find was in utf8, which I only found after I had analyzed the variable with Devel::Peek::Dump and then realized it was a UTF8 related effect.

My solution was to utf8::downgrade the string variable before using it with the above operators, which fixed my original problem.

But then I thought, wouldn't Perl be better, if the file system related functions would ensure this by calling utf8::downgrade() on (a copy of) the file path string themselves?

Before I suggest this as an improvement at Perl porters, I would like know, if there are any points/arguments/use cases against that suggestion?

Thanks for any enlightenments!

Replies are listed 'Best First'.
Re: utf8::downgrade() and file system operators
by hippo (Archbishop) on Feb 16, 2024 at 13:46 UTC

    The BUGS section of utf8 to which you linked begins with the qualification

    Some filesystems may not support ...

    I take that to mean that the behaviour you have encountered may well be dependent upon the filesystem type. It is probably worth verifying that before suggesting a fix which itself might break other things (eg. on other filesystems). If you have an SSCCE to illustrate the problem, I'm sure other monks (me included) would be happy to run your test case on other filesystems to see how they behave.


    🦛

      Thanks for the sensible suggestion!

      I put an SSCCE together, which produces this output with strawberry-perl v5.38.0 under Windows 10 (default file system type, I guess):

      perl -w .\TestFileOpsWithUTF8_Names.t 1..2 ok 1 - check -f with non-UTF8 file name not ok 2 - check -f with UTF8 file name # Failed test 'check -f with UTF8 file name' # at .\TestFileOpsWithUTF8_Names.t line 18. # Looks like you failed 1 test of 2.

        You forgot to create the file with the other name.
        open my $fh2, q{>}, $fnameUTF8 or die "could not create file $fnameUTF +8:$!"; close $fh2 or die "could not close file $fnameUTF8";

        You might want to unlink it at the end, too.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: utf8::downgrade() and file system operators
by NERDVANA (Priest) on Feb 17, 2024 at 04:37 UTC

    Perl compiled for Windows unfortunately uses the 8-bit variants of the filesystem functions, and (except for a recent change in Win10 described below) is generally unable to use characters outside of your 8-bit codepage, whatever that happens to be. In short, while downgrade might work for the particular character you are running into here, it won't generally work.

    Windows of course has a second filesystem API that uses UTF-16, and if perl used that it would have full unicode support ... but can't really do that either because of breaking backward compatibility and that perl APIs would start behaving differently between Linux and Windows. As dasgar mentions, you can access this API using Win32::LongPath. The downside is that now your script is using a funny API and is less portable.

    The "recent change in Win10" is that there is now an option in the executable properties/metadata where you can set a custom codepage on the application itself (as opposed to just the codepage of its terminal) and one of those codepages is UTF-8! Giving perl.exe a codepage of UTF-8 causes all its normal filesystem functions to suddenly just start working, because the wide characters get decomposed to UTF-8 sequences when passed to a filesystem API and now Windows understands those sequences and so it all just works. This is a very recent change to Win10 and (AFAIK) strawberry perl does not yet compile this as the default codepage for perl.exe, so you have to set it yourself.

    If you'd like to become a force of positive change, the right thing to ask for is for perl porters to change the Windows build settings to set the UTF-8 codepage on the executable by default. I don't know how to do this, and there's a good chance they also don't know how to do this, so if you did the research for them or submitted a patch, it would help everyone out.

    It should be noted that this change would break scripts that were using upper-ascii in the local codepage! For example, if a windows perl script wrote mkdir("\x{A9}") (Latin-1 copyright symbol) as a single byte, perl would not know that it needed encoded as UTF-8 before passing it to the mkdir() function. You would need to utf8::upgrade() or utf8::encode() it first. Or, use a unicode text editor to write the character literally in the string and declare use utf8; at the top of the file. Then, along with the patch, ask the maintainers of Strawberry to start releasing two versions of perl.exe, one with the UTF-8 codepage set, and one without.

    (I'm not currently using Windows, or an active user of this new feature. I'm just relaying information I've gathered in other threads around here)

        Yeah, that's what I was talking about. "Since 2019" wow, time flies. I would have guessed 2021.

        The deeper problem is that Perl doesn't know what character encoding *any* scalar is using and can't make fully intelligent choices about when to encode as UTF-8 or when to downgrade. This corresponds to the more general Unix problem of never knowing whether a user wants their Unix byte-oriented paths and environment to have UTF-8 or some other encoding. You can *kind of* guess based on whether their $ENV{LC_ALL} =~ /utf-8/ but there doesn't seem to really be any official "all things in my system should be unicode" setting.

        Windows (NT onward at least) has always had an understanding of which codepage it was operating under, and official ways to exchange unicode outside of that codepage. Perl doesn't have any way to generically tap into this knowledge without a matching understanding on Unix (or IBM AS-4000 or VMS or all the other places where perl might run) So... you're just stuck always manually preparing the correct encoding of filenames on your own. It takes an unfortunate amount of education for people to get it right, though.

        I also wrote up a Meditation about unicode filenames in general.

Re: utf8::downgrade() and file system operators
by dasgar (Priest) on Feb 17, 2024 at 00:54 UTC

    I'm not a subject matter expert on this and I'm going from my memory, so I apologize for using incorrect terminology or other unintentional inaccuracies.

    Windows has two different APIs for the filesystem. The default API that most programs use (including Perl) has a limitation of about 255 character length on paths and does not support Unicode characters. The other API supports both longer paths and Unicode characters.

    Although it sounds like you found a work around that appears to work with the paths that you tested, I personally would be concerned that there could be some situations (or corner cases) where that might not work. If you want to leverage the alternate filesystem API, you can take a look at Win32::LongPath, which "provides replacement functions for most of the native Perl file functions". From your post, you mention stat and file test operators. The Win32::LongPath alternates are statL and testL.

    I'll refrain from commenting on what should or should not be recommended to Perl porters on this topic because I don't think I understand things enough to speak intelligently on the topic.

      Thanks for bringing that second API to my attention!

      I agree that utf8::downgrade() is not a general solution, because it depends on the characters being mappable. They are in my use cases for now.

      I am interested in a more general solution without sacrificing portability, but if everything else fails, it is good to know that such a way out exists.