how are ARGV and filename strings represented?

almr has asked for the wisdom of the Perl Monks concerning the following question:

I can see what the OS passes to perl as "char *argv[]", and also what Perl actually stores in @ARGV, with the following:

# x.pl
use v5.28; use Data::HexDump;
use Encode qw(encode decode);
my $a0 = $ARGV[0];
my $a0d = decode( q(UTF-8), $a0 );
say HexDump($a0);
say HexDump($a0d);
open my $fh, q(>), $a0;
open my $fhd, q(>), $a0d;
sleep 99;

# shell
aa=$(perl -CSDA -we 'print 1 . chr(0xb1) . chr(0x155) . 2')
perl -w x.pl "$aa" &
ls -rt
xxd /proc/`pidof perl`/cmdline
[download]

This also creates a file (or two) with a "funny" name.

Now, on my particular system, it seems that perl starts with a UTF-8 encoding of "$aa" in its /proc/cmdline (i.e. the C "char *argv[]"), and that Perl @ARGV corresponds verbatim to C argv[].

It also seems that for open(), both byte-array strings, and also strings with one unicode codepoint / char (as obtained via decode) create the same filename. Maybe Perl does an implicit utf8 decode()?

But this is just an experiment on a particular setup. I'm not really sure (1) what Perl actually does on startup (does it process its C argv, maybe according to locale), (2) what open() is supposed to take, and (3) what readlink(), readdir() etc are supposed to return. What are the rules?

EDIT: fixed second perl invocation

Comment on how are ARGV and filename strings represented? Download Code

Replies are listed 'Best First'.
Re: how are ARGV and filename strings represented? by ikegami (Patriarch) on May 01, 2024 at 16:22 UTC
In Windows, `@ARGV` will contain the arguments encoded using the Active Code Page. Elsewhere, `@ARGV` will be an exact copy of the string passed to `exec` or whatever. `open` and Perl's other file operators suffer from The Unicode Bug. They will use the internal representation of the string provided. So in effect, it transforms the string as follows: `sub transform { my $s = shift; # Encode using utf8 string using the # upgraded/wide/UTF8=1 string storage format. utf8::encode( $s ) if utf8::is_utf8( $s ); return $s; }` [download] On Windows, the string is expected to be the file name encoded using the Active Code Page.	[reply] [d/l] [select]
Re^2: how are ARGV and filename strings represented? by almr (Beadle) on May 01, 2024 at 17:31 UTC
`utf8::encode( $s ) if utf8::is_utf8( $s );` did you mean "encode() unless is_utf8()"? file operators suffer from The Unicode Bug What does this mean in this context? I wasn't able to trigger incorrect filenames with any ARGV; but the following did, and seems to be what you're referring to: `IO::File->new( chr (0xbb), q(>))` [download] So what should I do -- always encode args to open()? As for Windows, is there a "best practices" prelude somewhere? I've seen a lot of confusing answers, and I wasn't even attempting to tackle it, but now that you've mentioned it I'd like to know.	[reply] [d/l] [select]
Re^3: how are ARGV and filename strings represented? by ikegami (Patriarch) on May 01, 2024 at 18:33 UTC
did you mean "encode() unless is_utf8()"? No. `is_utf8` returns true if the string is stored using the upgraded/wide/UTF8=1 internal storage format. So the sub encodes the string using UTF-8 if it's stored in that format. `utf8::downgrade( my $d = "\xE3" ); printf "%vX\n", $d; # E3 printf "%d\n", utf8::is_utf8( $d ) ? 1 : 0; # 0 printf "%vX\n", transform( $d ); # E3 utf8::upgrade( my $u = "\xE3" ); printf "%vX\n", $u; # E3 printf "%d\n", utf8::is_utf8( $u ) ? 1 : 0; # 1 printf "%vX\n", transform( $u ); # C3.A3 utf8::upgrade( my $w = "\x{2660}" ); printf "%vX\n", $w; # 2660 printf "%d\n", utf8::is_utf8( $w ) ? 1 : 0; # 1 printf "%vX\n", transform( $w ); # E2.99.A0` [download] `IO::File->new( chr (0xbb), q(>))` `chr(0xbb)` returns a string using the downgraded/8bit/UTF8=0 internal storage format. Outside of Windows, that will create file whose name consists of the byte BB. In Windows, that will create file whose name is the result of `decode( "cp".Win32::GetACP(), chr( 0xBB ) )`. So what should I do -- always encode args to open()? File names in unix are just a sequence of a bytes (which may not contain bytes 00 and 2F). It's best if they're text encoded using the current locale, but you could obtain file names which aren't. For creating a file? `encode` returns a string using the downgraded/8bit/UTF8=0 internal storage format, so it won't be mangled by `open`.	[reply] [d/l] [select]
Re^3: how are ARGV and filename strings represented? by NERDVANA (Priest) on May 01, 2024 at 21:00 UTC
file operators suffer from The Unicode Bug What does this mean in this context? It's not a "bug" per-se, just unspecified behavior that people constantly run into. Perl is designed to run on a wide variety of systems, and there is no one-size-fits-all solution for character encoding, so perl just doesn't solve it. The result is that people on Unix have to explicitly encode and decode the bytes for their file names and environment variables and ARGV and stdin/stdout/stderr, and on Windows things are just kind of broken because the Windows 8-bit APIs use a codepage and don't provide any Unicode workaround (until recently, where they give you a utf-8 codepage that behaves like Unix, but you have to configure that in the manifest of the .exe file and only works on recent-ish Win10 versions and newer) See Also: Handling of Unicode File Names What would you like to see in a Virtual Filesystem for Perl? If you need Unicode filesystem or shell support on Windows, I recommend Cygwin's perl until such time as Strawberry releases a perl.exe with the UTF-8 codepage configured.	[reply]
Re^4: how are ARGV and filename strings represented? by ikegami (Patriarch) on May 02, 2024 at 01:35 UTC
Re^5: how are ARGV and filename strings represented? by NERDVANA (Priest) on May 02, 2024 at 16:48 UTC
Some notes below your chosen depth have not been shown here
Re^5: how are ARGV and filename strings represented? by almr (Beadle) on May 02, 2024 at 18:50 UTC
Some notes below your chosen depth have not been shown here
Re^3: how are ARGV and filename strings represented? by ikegami (Patriarch) on May 01, 2024 at 18:52 UTC
As for Windows, using Win32::LongPath avoids headaches.	[reply]
Re: how are ARGV and filename strings represented? by harangzsolt33 (Deacon) on May 01, 2024 at 17:50 UTC
If you just type any word, @ARGV will contain that word unless it appears to be a file name with asterisk, because in that case, you will get a list of file names. I don't know if the shell is responsible for this or if Perl does this, but here is an example: (owner)~# perl -e ' foreach (@ARGV) { print "\n$_"; } ' hello world * hello world BIN Desktop Documents Downloads HTML JSRef Music PerlRef Pictures Public SAFE Scripts temp Templates Videos (owner)~# The above one-liner will list all files in the current directory, because I included one asterisk in the argument line. EDIT: One piece of advice I would have is try to avoid working with files that have any Unicode characters in the file name. I made a little program that renames all files on my computer to standard ASCII names. I had so much trouble with such filenames until I said, "You know what? I'm done with that. I shall never use Unicode chars in file names ever agian." Why make your life difficult for no reason? Avoid trouble and stop using Unicode chars in file names. It's that simple. It's the truth. Someone had to say it.	[reply]
Re^2: how are ARGV and filename strings represented? by ikegami (Patriarch) on May 01, 2024 at 18:36 UTC
That's entirely the shell's doing.	[reply]
Re^2: how are ARGV and filename strings represented? by soonix (Chancellor) on May 02, 2024 at 09:25 UTC
That depends on the shell. If that shell happens to be CMD.EXE or COMMAND.COM (I think also Powershell.exe and pwsh.exe), the asterisks you type in arguments have to be handled by Perl, because DOS/Windows has a differenct concept of command lines, probably inherited from CP/M or so.	[reply]
Re^2: how are ARGV and filename strings represented? by Anonymous Monk on May 02, 2024 at 09:45 UTC
This 'advice' is so bad. Someone had to say it.	[reply]