Re^2: how are ARGV and filename strings represented?

Replies are listed 'Best First'.
Re^3: how are ARGV and filename strings represented? by ikegami (Patriarch) on May 01, 2024 at 18:33 UTC
did you mean "encode() unless is_utf8()"? No. `is_utf8` returns true if the string is stored using the upgraded/wide/UTF8=1 internal storage format. So the sub encodes the string using UTF-8 if it's stored in that format. `utf8::downgrade( my $d = "\xE3" ); printf "%vX\n", $d; # E3 printf "%d\n", utf8::is_utf8( $d ) ? 1 : 0; # 0 printf "%vX\n", transform( $d ); # E3 utf8::upgrade( my $u = "\xE3" ); printf "%vX\n", $u; # E3 printf "%d\n", utf8::is_utf8( $u ) ? 1 : 0; # 1 printf "%vX\n", transform( $u ); # C3.A3 utf8::upgrade( my $w = "\x{2660}" ); printf "%vX\n", $w; # 2660 printf "%d\n", utf8::is_utf8( $w ) ? 1 : 0; # 1 printf "%vX\n", transform( $w ); # E2.99.A0` [download] `IO::File->new( chr (0xbb), q(>))` `chr(0xbb)` returns a string using the downgraded/8bit/UTF8=0 internal storage format. Outside of Windows, that will create file whose name consists of the byte BB. In Windows, that will create file whose name is the result of `decode( "cp".Win32::GetACP(), chr( 0xBB ) )`. So what should I do -- always encode args to open()? File names in unix are just a sequence of a bytes (which may not contain bytes 00 and 2F). It's best if they're text encoded using the current locale, but you could obtain file names which aren't. For creating a file? `encode` returns a string using the downgraded/8bit/UTF8=0 internal storage format, so it won't be mangled by `open`.	[reply] [d/l] [select]
Re^3: how are ARGV and filename strings represented? by NERDVANA (Priest) on May 01, 2024 at 21:00 UTC
file operators suffer from The Unicode Bug What does this mean in this context? It's not a "bug" per-se, just unspecified behavior that people constantly run into. Perl is designed to run on a wide variety of systems, and there is no one-size-fits-all solution for character encoding, so perl just doesn't solve it. The result is that people on Unix have to explicitly encode and decode the bytes for their file names and environment variables and ARGV and stdin/stdout/stderr, and on Windows things are just kind of broken because the Windows 8-bit APIs use a codepage and don't provide any Unicode workaround (until recently, where they give you a utf-8 codepage that behaves like Unix, but you have to configure that in the manifest of the .exe file and only works on recent-ish Win10 versions and newer) See Also: Handling of Unicode File Names What would you like to see in a Virtual Filesystem for Perl? If you need Unicode filesystem or shell support on Windows, I recommend Cygwin's perl until such time as Strawberry releases a perl.exe with the UTF-8 codepage configured.	[reply]
Re^4: how are ARGV and filename strings represented? by ikegami (Patriarch) on May 02, 2024 at 01:35 UTC
It's not a "bug" per-se It is. And that's the official name for code that behaves differently depending on the internal representation of a string. just unspecified behavior that people constantly run into. The issue is not that the behaviour is unspecified. A lot of Perl behaviour is unspecified. That's not really an issue because there's only one Perl interpreter. The interpreter is the language, so to speak. The issue is that two equal strings produce different results. That is most definitely a bug. `$d eq $u` is true in the snippet I provided earlier, so `open` should do the same for both. But it doesn't. That's a bug. there is no one-size-fits-all solution for character encoding, Not so. In this area, it's most definitely possibly to provide an interface that works on all systems. until such time as Strawberry releases a perl.exe with the UTF-8 codepage configured. I did file a ticket requesting this some time ago.	[reply] [d/l] [select]
Re^5: how are ARGV and filename strings represented? by NERDVANA (Priest) on May 02, 2024 at 16:48 UTC
If I'm writing C, and do dumb things with pointers like: `void function1() { float x, y, z; function2(&x); } void function2(float *point) { point[2]= 5; }` [download] That is unspecified behavior. It doesn't generate a warning, and on some hosts it will work and on other hosts it will not work (or even crash) depending on how the compiler chose the internal representation. It isn't a "bug" in the C language, but it's certainly a footgun. My understanding of Perl's filesystem rules are that the programmer is responsible for performing encoding on every string prior to passing it to the OS, using their knowledge of the OS. If you skip that encoding step (as most of us do, because there's no easy universal facility for us to know which encoding the host is using) then you run the risk of undefined behavior, which may manifest itself differently depending on internal details of the string. I think it would be very nice to have a facility to know how to encode filesystem strings, and a library that performs that job so that doing it right is easier than not. I wouldn't consider that library to be working around a bug, but rather supplying a missing feature of the language.	[reply] [d/l]
Re^6: how are ARGV and filename strings represented? by afoken (Chancellor) on May 05, 2024 at 13:17 UTC
Re^7: how are ARGV and filename strings represented? by choroba (Cardinal) on May 05, 2024 at 15:11 UTC
Re^7: how are ARGV and filename strings represented? by ikegami (Patriarch) on May 05, 2024 at 13:33 UTC
Some notes below your chosen depth have not been shown here
Re^7: how are ARGV and filename strings represented? by afoken (Chancellor) on Oct 01, 2024 at 23:47 UTC
Some notes below your chosen depth have not been shown here
Re^6: how are ARGV and filename strings represented? by ikegami (Patriarch) on May 03, 2024 at 18:06 UTC
Re^5: how are ARGV and filename strings represented? by almr (Beadle) on May 02, 2024 at 18:50 UTC
$d eq $u is true in the snippet I provided earlier, so open should do the same for both. But it doesn't. That's a bug. Yes. But you created $d and $u using "internal" Perl functions (AFAICT the programmer is not supposed to invoke upgrade() / downgrade() directly?). Now, can this bug be reproduced using "supported" operations? You can reproduce it by concatenating the result of readlink() or glob to a codepoint-string, but that in itself means breaking the conventions (because you're not supposed to mix codepoint-strings with byte-strings) So now I'm starting to lean to the position that one should decode ARGV, decode STDIN, decode readlink() and glob output, and thus always work with codepoint-strings (which can be safely concatenated, trimmed etc and then passed to open(), because open() "calls" transform(), which detects that it needs to encode them).	[reply]
Re^6: how are ARGV and filename strings represented? by ikegami (Patriarch) on May 03, 2024 at 18:32 UTC
Re^6: how are ARGV and filename strings represented? by ikegami (Patriarch) on May 03, 2024 at 18:10 UTC
Re^7: how are ARGV and filename strings represented? by almr (Beadle) on May 05, 2024 at 17:20 UTC
Re^3: how are ARGV and filename strings represented? by ikegami (Patriarch) on May 01, 2024 at 18:52 UTC
As for Windows, using Win32::LongPath avoids headaches.	[reply]