in reply to utf8::downgrade() and file system operators

Perl compiled for Windows unfortunately uses the 8-bit variants of the filesystem functions, and (except for a recent change in Win10 described below) is generally unable to use characters outside of your 8-bit codepage, whatever that happens to be. In short, while downgrade might work for the particular character you are running into here, it won't generally work.

Windows of course has a second filesystem API that uses UTF-16, and if perl used that it would have full unicode support ... but can't really do that either because of breaking backward compatibility and that perl APIs would start behaving differently between Linux and Windows. As dasgar mentions, you can access this API using Win32::LongPath. The downside is that now your script is using a funny API and is less portable.

The "recent change in Win10" is that there is now an option in the executable properties/metadata where you can set a custom codepage on the application itself (as opposed to just the codepage of its terminal) and one of those codepages is UTF-8! Giving perl.exe a codepage of UTF-8 causes all its normal filesystem functions to suddenly just start working, because the wide characters get decomposed to UTF-8 sequences when passed to a filesystem API and now Windows understands those sequences and so it all just works. This is a very recent change to Win10 and (AFAIK) strawberry perl does not yet compile this as the default codepage for perl.exe, so you have to set it yourself.

If you'd like to become a force of positive change, the right thing to ask for is for perl porters to change the Windows build settings to set the UTF-8 codepage on the executable by default. I don't know how to do this, and there's a good chance they also don't know how to do this, so if you did the research for them or submitted a patch, it would help everyone out.

It should be noted that this change would break scripts that were using upper-ascii in the local codepage! For example, if a windows perl script wrote mkdir("\x{A9}") (Latin-1 copyright symbol) as a single byte, perl would not know that it needed encoded as UTF-8 before passing it to the mkdir() function. You would need to utf8::upgrade() or utf8::encode() it first. Or, use a unicode text editor to write the character literally in the string and declare use utf8; at the top of the file. Then, along with the patch, ask the maintainers of Strawberry to start releasing two versions of perl.exe, one with the UTF-8 codepage set, and one without.

(I'm not currently using Windows, or an active user of this new feature. I'm just relaying information I've gathered in other threads around here)

Replies are listed 'Best First'.
Re^2: utf8::downgrade() and file system operators
by hexcoder (Curate) on Feb 17, 2024 at 12:45 UTC
      Yeah, that's what I was talking about. "Since 2019" wow, time flies. I would have guessed 2021.

      The deeper problem is that Perl doesn't know what character encoding *any* scalar is using and can't make fully intelligent choices about when to encode as UTF-8 or when to downgrade. This corresponds to the more general Unix problem of never knowing whether a user wants their Unix byte-oriented paths and environment to have UTF-8 or some other encoding. You can *kind of* guess based on whether their $ENV{LC_ALL} =~ /utf-8/ but there doesn't seem to really be any official "all things in my system should be unicode" setting.

      Windows (NT onward at least) has always had an understanding of which codepage it was operating under, and official ways to exchange unicode outside of that codepage. Perl doesn't have any way to generically tap into this knowledge without a matching understanding on Unix (or IBM AS-4000 or VMS or all the other places where perl might run) So... you're just stuck always manually preparing the correct encoding of filenames on your own. It takes an unfortunate amount of education for people to get it right, though.

      I also wrote up a Meditation about unicode filenames in general.