mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:
My goal is to use the substitution operator (s///) to replace occurrences of a question mark (?) with an inverted question mark (¿) om specific line in a large number of files. I am having trouble with what is actually getting substituted inside the file in that it does not match what ends up in the file name in the file system. I am grateful for any tips or guidance as to what to have that which is inside the files match various file names out in the file system. Perhaps it is matter of encoding, again?
It is claimed that the inverted questionmark is \x00BF, which strangely is C2 BF in UTF-8 according to a "Unicode Character Table site.
In the shell (Bash) on an EXT4, that seems to be the case and the Perl utility rename seems to work that way, too.
$ touch ¿ $ ls ? > zz $ xxd zz 00000000: c2bf 0a $ echo '¿' > yy $ xxd yy 00000000: c2bf 0a $ touch xx $ rename -v 's/xx/¿/;' xx xx not renamed: ¿ already exists $ rename --version /usr/bin/rename using File::Rename version 1.13, File::Rename::Options + version 1.10
And those files show up in Apache2's access logs containing the escape sequence "%C2%BF" in the URL in place of the inverted question mark.
Yet, shouldn't that be \x00BF all the way through? My own Perl scripts work differently:
$ perl -e 'use utf8; print "¿\n"' > ww $ xxd ww 00000000: bf0a $ perl -e 'use utf8; $c="¿\n"; utf8::upgrade($c); print $c' > vv $ xxd vv 00000000: bf0a
Though if I leave out the use utf8 part, then I kind of get the "right" result only according to xxd,
$ perl -e 'print "¿\n"' > uu $ xxd uu 00000000: c2bf 0a $ curl --silent --head 'http://localhost/' | grep 'Content-Type' Content-Type: text/html; charset=utf-8
While keeping UTF-8, how can I get "¿" inside the files to match the "¿" out in the file name and still look right?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Matching non-ASCII file contents with file name.
by Corion (Patriarch) on Dec 22, 2022 at 12:07 UTC | |
by mldvx4 (Hermit) on Dec 23, 2022 at 06:39 UTC | |
by Corion (Patriarch) on Dec 23, 2022 at 07:16 UTC | |
by hippo (Archbishop) on Dec 23, 2022 at 08:26 UTC | |
|
Re: Matching non-ASCII file contents with file name.
by haukex (Archbishop) on Dec 22, 2022 at 12:40 UTC | |
by mldvx4 (Hermit) on Dec 23, 2022 at 06:39 UTC |