in reply to how are ARGV and filename strings represented?

In Windows, @ARGV will contain the arguments encoded using the Active Code Page.

Elsewhere, @ARGV will be an exact copy of the string passed to exec or whatever.


open and Perl's other file operators suffer from The Unicode Bug. They will use the internal representation of the string provided. So in effect, it transforms the string as follows:

sub transform { my $s = shift; # Encode using utf8 string using the # upgraded/wide/UTF8=1 string storage format. utf8::encode( $s ) if utf8::is_utf8( $s ); return $s; }

On Windows, the string is expected to be the file name encoded using the Active Code Page.

Replies are listed 'Best First'.
Re^2: how are ARGV and filename strings represented?
by almr (Beadle) on May 01, 2024 at 17:31 UTC
       utf8::encode( $s ) if utf8::is_utf8( $s );

    did you mean "encode() unless is_utf8()"?

    file operators suffer from The Unicode Bug
    What does this mean in this context? I wasn't able to trigger incorrect filenames with any ARGV; but the following did, and seems to be what you're referring to:
    IO::File->new( chr (0xbb), q(>))
    So what should I do -- always encode args to open()?

    As for Windows, is there a "best practices" prelude somewhere? I've seen a lot of confusing answers, and I wasn't even attempting to tackle it, but now that you've mentioned it I'd like to know.

      did you mean "encode() unless is_utf8()"?

      No. is_utf8 returns true if the string is stored using the upgraded/wide/UTF8=1 internal storage format. So the sub encodes the string using UTF-8 if it's stored in that format.

      utf8::downgrade( my $d = "\xE3" ); printf "%vX\n", $d; # E3 printf "%d\n", utf8::is_utf8( $d ) ? 1 : 0; # 0 printf "%vX\n", transform( $d ); # E3 utf8::upgrade( my $u = "\xE3" ); printf "%vX\n", $u; # E3 printf "%d\n", utf8::is_utf8( $u ) ? 1 : 0; # 1 printf "%vX\n", transform( $u ); # C3.A3 utf8::upgrade( my $w = "\x{2660}" ); printf "%vX\n", $w; # 2660 printf "%d\n", utf8::is_utf8( $w ) ? 1 : 0; # 1 printf "%vX\n", transform( $w ); # E2.99.A0

      IO::File->new( chr (0xbb), q(>))

      chr(0xbb) returns a string using the downgraded/8bit/UTF8=0 internal storage format.

      Outside of Windows, that will create file whose name consists of the byte BB.

      In Windows, that will create file whose name is the result of decode( "cp".Win32::GetACP(), chr( 0xBB ) ).

      So what should I do -- always encode args to open()?

      File names in unix are just a sequence of a bytes (which may not contain bytes 00 and 2F). It's best if they're text encoded using the current locale, but you could obtain file names which aren't.

      For creating a file? encode returns a string using the downgraded/8bit/UTF8=0 internal storage format, so it won't be mangled by open.

      file operators suffer from The Unicode Bug
      What does this mean in this context?

      It's not a "bug" per-se, just unspecified behavior that people constantly run into. Perl is designed to run on a wide variety of systems, and there is no one-size-fits-all solution for character encoding, so perl just doesn't solve it. The result is that people on Unix have to explicitly encode and decode the bytes for their file names and environment variables and ARGV and stdin/stdout/stderr, and on Windows things are just kind of broken because the Windows 8-bit APIs use a codepage and don't provide any Unicode workaround (until recently, where they give you a utf-8 codepage that behaves like Unix, but you have to configure that in the manifest of the .exe file and only works on recent-ish Win10 versions and newer)

      See Also:

      If you need Unicode filesystem or shell support on Windows, I recommend Cygwin's perl until such time as Strawberry releases a perl.exe with the UTF-8 codepage configured.

        It's not a "bug" per-se

        It is. And that's the official name for code that behaves differently depending on the internal representation of a string.

        just unspecified behavior that people constantly run into.

        The issue is not that the behaviour is unspecified. A lot of Perl behaviour is unspecified. That's not really an issue because there's only one Perl interpreter. The interpreter is the language, so to speak.

        The issue is that two equal strings produce different results. That is most definitely a bug.

        $d eq $u is true in the snippet I provided earlier, so open should do the same for both. But it doesn't. That's a bug.

        there is no one-size-fits-all solution for character encoding,

        Not so. In this area, it's most definitely possibly to provide an interface that works on all systems.

        until such time as Strawberry releases a perl.exe with the UTF-8 codepage configured.

        I did file a ticket requesting this some time ago.