Re^2: utf8::downgrade() and file system operators

Thanks very much for this very detailed description of the current situation (very much appreciated)!

I googled for codepage setting and came up with this entry: https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page. Is that the API you meant?

From the above linked page I gather that a manifest file (setting UTF8 codepage) can be attached to the executable (strawberry-perl) with mt.exe. I had naīvely checked that out with perl.exe, but it did not work, which could be my fault.

I will proceed with looking into the Strawberry code base as time permits. Edit: I found this related issue from ikegami (from 2020): https://github.com/StrawberryPerl/Perl-Dist-Strawberry/issues/18.

Edit2: There is also an issue in the upstream perl repo: https://github.com/Perl/perl5/issues/17094.

Thanks again for setting me on the right track.

Comment on Re^2: utf8::downgrade() and file system operators Select or Download Code

Replies are listed 'Best First'.
Re^3: utf8::downgrade() and file system operators by NERDVANA (Priest) on Feb 18, 2024 at 03:38 UTC
Yeah, that's what I was talking about. "Since 2019" wow, time flies. I would have guessed 2021. The deeper problem is that Perl doesn't know what character encoding any scalar is using and can't make fully intelligent choices about when to encode as UTF-8 or when to downgrade. This corresponds to the more general Unix problem of never knowing whether a user wants their Unix byte-oriented paths and environment to have UTF-8 or some other encoding. You can kind of guess based on whether their $ENV{LC_ALL} =~ /utf-8/ but there doesn't seem to really be any official "all things in my system should be unicode" setting. Windows (NT onward at least) has always had an understanding of which codepage it was operating under, and official ways to exchange unicode outside of that codepage. Perl doesn't have any way to generically tap into this knowledge without a matching understanding on Unix (or IBM AS-4000 or VMS or all the other places where perl might run) So... you're just stuck always manually preparing the correct encoding of filenames on your own. It takes an unfortunate amount of education for people to get it right, though. I also wrote up a Meditation about unicode filenames in general.	[reply]

Replies are listed 'Best First'.

Re^3: utf8::downgrade() and file system operators
by NERDVANA (Priest) on Feb 18, 2024 at 03:38 UTC

The deeper problem is that Perl doesn't know what character encoding *any* scalar is using and can't make fully intelligent choices about when to encode as UTF-8 or when to downgrade. This corresponds to the more general Unix problem of never knowing whether a user wants their Unix byte-oriented paths and environment to have UTF-8 or some other encoding. You can *kind of* guess based on whether their $ENV{LC_ALL} =~ /utf-8/ but there doesn't seem to really be any official "all things in my system should be unicode" setting.

Windows (NT onward at least) has always had an understanding of which codepage it was operating under, and official ways to exchange unicode outside of that codepage. Perl doesn't have any way to generically tap into this knowledge without a matching understanding on Unix (or IBM AS-4000 or VMS or all the other places where perl might run) So... you're just stuck always manually preparing the correct encoding of filenames on your own. It takes an unfortunate amount of education for people to get it right, though.

I also wrote up a Meditation about unicode filenames in general.

[reply]