Re: Matching non-ASCII file contents with file name.

It is claimed that the inverted questionmark is \x00BF, which strangely is C2 BF in UTF-8 ... And those files show up in Apache2's access logs containing the escape sequence "%C2%BF" in the URL in place of the inverted question mark. Yet, shouldn't that be \x00BF all the way through?

C2 BF is the correct UTF-8 encoding for the unicode character U+00BF INVERTED QUESTION MARK.

This question is complicated by the fact that we don't know what your shell/terminal's encoding is, which is why I suspect most of the examples you showed aren't representative.

Plus, AFAIK, file name encodings are a very complicated topic, and therefore I think you might do yourself a favor by not using "¿" in filenames and URLs.

I'll take your "While keeping UTF-8" to mean you want UTF-8 everywhere, so below is one way of getting that - at least on *NIX, where many tools assume UTF-8 filenames anyway and my shell and terminal are also UTF-8; these kind of filenames may be even more complicated on Windows, I'm not sure.

BTW, my personal preference is using Perl's \x notation only for bytes in the range 00-FF, while using \N{} for Unicode, as I do below. This is why I don't need the use utf8;, this source code is entirely ASCII (but you could use utf8; and then use ¿ instead of \N{U+BF} if you wanted). If you wanted this code to write UTF-8 to STDOUT, like for example print "Wrote file $newname\n";, you'd need to add a use open qw/:std :encoding(UTF-8)/;.

$ cat test.pl
#!/usr/bin/env perl
use warnings;
use strict;

my $fname = 'test.txt';
my $newname = "new\N{U+BF}.txt";

open my $fh, '>:raw:encoding(UTF-8)', $fname or die "$fname: $!";
print $fh "Hello?\n";
close $fh;

open my $ofh, '>:raw:encoding(UTF-8)', $newname or die "$newname: $!";
open my $ifh, '<:raw:encoding(UTF-8)', $fname or die "$fname: $!";
while ( my $line = <$ifh> ) {
	$line =~ s/\?/\N{U+BF}/g;
	print $ofh $line;
}
close $ifh;
close $ofh;

$ perl test.pl
$ hexdump -C new¿.txt 
00000000  48 65 6c 6c 6f c2 bf 0a                           |Hello...|
00000008

Comment on Re: Matching non-ASCII file contents with file name. Select or Download Code

Replies are listed 'Best First'.
Re^2: Matching non-ASCII file contents with file name. by mldvx4 (Hermit) on Dec 23, 2022 at 06:39 UTC
Thanks! The detailed explanation helped and is appreciated. "Plus, AFAIK, file name encodings are a very complicated topic, and therefore I think you might do yourself a favor by not using "¿" in filenames and URLs." Of course. However, there are several reasons: Use of non-ASCII characters like Ö, Ø, Ó, Ô, 月, 日, or even ¿ or ¡ is to be expected these days, even in file names and thus URLs. The `rename` utility listed above deals with the renaming, and seems to match what can be produced manually via a local terminal emulator, a local console, or a remote ssh+tmux connection. So it was my script which was the odd man out and therefore needed correction. The file names, minus the inverted question mark, are the result of using `wget` to scrape the output from some legacy PHP scripts which are not / cannot be maintained any more. Aside from the very long file names, the method works reasonably well for converting the whole mess to a static HTML archive. Unfortunately, that leaves a question mark in the file name and that is not tolerated by web servers and use it to delimit the start of a query string and the end of the file name. So a replacement character is needed and ¿ seems the least problematic semantically.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Matching non-ASCII file contents with file name.
by mldvx4 (Hermit) on Dec 23, 2022 at 06:39 UTC

Thanks! The detailed explanation helped and is appreciated.

"Plus, AFAIK, file name encodings are a very complicated topic, and therefore I think you might do yourself a favor by not using "¿" in filenames and URLs."

Of course. However, there are several reasons:

Use of non-ASCII characters like Ö, Ø, Ó, Ô, 月, 日, or even ¿ or ¡ is to be expected these days, even in file names and thus URLs.

The rename utility listed above deals with the renaming, and seems to match what can be produced manually via a local terminal emulator, a local console, or a remote ssh+tmux connection. So it was my script which was the odd man out and therefore needed correction.

The file names, minus the inverted question mark, are the result of using wget to scrape the output from some legacy PHP scripts which are not / cannot be maintained any more. Aside from the very long file names, the method works reasonably well for converting the whole mess to a static HTML archive. Unfortunately, that leaves a question mark in the file name and that is not tolerated by web servers and use it to delimit the start of a query string and the end of the file name. So a replacement character is needed and ¿ seems the least problematic semantically.

[reply]
[d/l]
[select]