comment on

It is claimed that the inverted questionmark is \x00BF, which strangely is C2 BF in UTF-8 ... And those files show up in Apache2's access logs containing the escape sequence "%C2%BF" in the URL in place of the inverted question mark. Yet, shouldn't that be \x00BF all the way through?

C2 BF is the correct UTF-8 encoding for the unicode character U+00BF INVERTED QUESTION MARK.

This question is complicated by the fact that we don't know what your shell/terminal's encoding is, which is why I suspect most of the examples you showed aren't representative.

Plus, AFAIK, file name encodings are a very complicated topic, and therefore I think you might do yourself a favor by not using "ż" in filenames and URLs.

I'll take your "While keeping UTF-8" to mean you want UTF-8 everywhere, so below is one way of getting that - at least on *NIX, where many tools assume UTF-8 filenames anyway and my shell and terminal are also UTF-8; these kind of filenames may be even more complicated on Windows, I'm not sure.

BTW, my personal preference is using Perl's \x notation only for bytes in the range 00-FF, while using \N{} for Unicode, as I do below. This is why I don't need the use utf8;, this source code is entirely ASCII (but you could use utf8; and then use ¿ instead of \N{U+BF} if you wanted). If you wanted this code to write UTF-8 to STDOUT, like for example print "Wrote file $newname\n";, you'd need to add a use open qw/:std :encoding(UTF-8)/;.

$ cat test.pl
#!/usr/bin/env perl
use warnings;
use strict;

my $fname = 'test.txt';
my $newname = "new\N{U+BF}.txt";

open my $fh, '>:raw:encoding(UTF-8)', $fname or die "$fname: $!";
print $fh "Hello?\n";
close $fh;

open my $ofh, '>:raw:encoding(UTF-8)', $newname or die "$newname: $!";
open my $ifh, '<:raw:encoding(UTF-8)', $fname or die "$fname: $!";
while ( my $line = <$ifh> ) {
	$line =~ s/\?/\N{U+BF}/g;
	print $ofh $line;
}
close $ifh;
close $ofh;

$ perl test.pl
$ hexdump -C new¿.txt 
00000000  48 65 6c 6c 6f c2 bf 0a                           |Hello...|
00000008

In reply to Re: Matching non-ASCII file contents with file name. by haukex
in thread Matching non-ASCII file contents with file name. by mldvx4

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.