in reply to Matching non-ASCII file contents with file name.
It is claimed that the inverted questionmark is \x00BF, which strangely is C2 BF in UTF-8 ... And those files show up in Apache2's access logs containing the escape sequence "%C2%BF" in the URL in place of the inverted question mark. Yet, shouldn't that be \x00BF all the way through?
C2 BF is the correct UTF-8 encoding for the unicode character U+00BF INVERTED QUESTION MARK.
This question is complicated by the fact that we don't know what your shell/terminal's encoding is, which is why I suspect most of the examples you showed aren't representative.
Plus, AFAIK, file name encodings are a very complicated topic, and therefore I think you might do yourself a favor by not using "¿" in filenames and URLs.
I'll take your "While keeping UTF-8" to mean you want UTF-8 everywhere, so below is one way of getting that - at least on *NIX, where many tools assume UTF-8 filenames anyway and my shell and terminal are also UTF-8; these kind of filenames may be even more complicated on Windows, I'm not sure.
BTW, my personal preference is using Perl's \x notation only for bytes in the range 00-FF, while using \N{} for Unicode, as I do below. This is why I don't need the use utf8;, this source code is entirely ASCII (but you could use utf8; and then use ¿ instead of \N{U+BF} if you wanted). If you wanted this code to write UTF-8 to STDOUT, like for example print "Wrote file $newname\n";, you'd need to add a use open qw/:std :encoding(UTF-8)/;.
$ cat test.pl
#!/usr/bin/env perl
use warnings;
use strict;
my $fname = 'test.txt';
my $newname = "new\N{U+BF}.txt";
open my $fh, '>:raw:encoding(UTF-8)', $fname or die "$fname: $!";
print $fh "Hello?\n";
close $fh;
open my $ofh, '>:raw:encoding(UTF-8)', $newname or die "$newname: $!";
open my $ifh, '<:raw:encoding(UTF-8)', $fname or die "$fname: $!";
while ( my $line = <$ifh> ) {
$line =~ s/\?/\N{U+BF}/g;
print $ofh $line;
}
close $ifh;
close $ofh;
$ perl test.pl
$ hexdump -C new¿.txt
00000000 48 65 6c 6c 6f c2 bf 0a |Hello...|
00000008
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Matching non-ASCII file contents with file name.
by mldvx4 (Hermit) on Dec 23, 2022 at 06:39 UTC |