Re: utf8 in directory and filenames

I did some playing around, and discovered that I could create a file on my macosx with utf8 characters in the file name (using perl). Of course, to the unix-based OS, such a file name just has some bytes whose high bits are set; their interpretation as characters in a given encoding (and how many octets there may be per character) is of no consequence to the OS or the library calls that handle file access.

Looking at the OP code snippets, there may have been a typo there, so I'd like to get better clarification. Can you post a minimal, fully-functional and self-contained script that demonstrates your problem with the "-f" operator -- and have it produce some informative output that you can post along with the code? Something like this:

#!/usr/bin/perl

use strict;

binmode STDOUT, ":raw";  # make sure output is not "embellished"

my $path = ( $ARGV[0] and -d $ARGV[0] ) ? shift : ".";

opendir( D, $path );
my @files = grep /[^.]/, readdir( D );
for my $file ( @files ) {
    my $type = ( -f "$path/$file" ) ? "file" :
               ( -d _ ) ? "subd" : "othr";
    print "$file ==$type\n";
}
[download]

If the output of that snippet is piped to a hex dump tool (like "od -txC"), the result would be informative. I tried that on the directory where I created a utf8 file name, and I got what I expected: a two byte sequence in the file name that happens to work as a utf8 accented character. Doing nothing at all to the string returned by readdir(), I was able to get the correct stat info on that file.

(The apparent typo in the OP was: $entry = shift $A; -- I hope you have "use strict;" in there somewhere so that, assuming $A is not explicitly declared, this sort of typo would be a compile-time error? Actually, that statement would cause an error anyway, but still...)

If your linux server is anything like my bsd-based mac, the file names with above-ascii characters should show up as files (unless they happen to be directories), and their names should contain whatever byte sequence was delivered to the server by those Windows systems -- no matter what their encoding that may have been.

Update: Okay, as Juerd has been trying to explain, things are a little more dicey, thanks to Perl 5.8's "special treatment" of bytes/utf8-codepoints in the 0x80-0xff range. (The success I reported above was only true when the "wide character" in the file name was above U+00FF.)

When I tried the lengthy script you posted in your most recent reply, it broke at line 12 with "Invalid argument" -- for some reason, the string being passed to open() as a file name was not acceptable.

I found I could create a file name containing bytes in the 80-FF range if I did it like this:

use Encode;

$_=encode("utf8","l\x{00e3}s");  # make sure the utf8 flag is off!

print "creating $_\n";
open( O,">",$_ ) or die "$_:$!\n";
print O "test\n"
close O;
[download]

But strangely -- and unfortunately -- the actual file name that appears in the directory turns out to be five bytes long instead of the expected four: "l a \xCC \x83 s".

What happened? Perl (or an underlying library?) somehow decided to break the ã character into its two distinct components: the unadorned (ascii) "a", and the unicode "combining tilde" (U+0303, which shows up in utf8 as the two-byte sequence "\xCC \x83". Why/how is it happening? I don't know yet.

Well, that's an eye-opener. I wish I knew how to circumvent that sort of behavior, especially since it seems to apply only to bytes/codepoints in the 80-FF range.

Comment on Re: utf8 in directory and filenames Select or Download Code

Replies are listed 'Best First'.
Re^2: utf8 in directory and filenames by almut (Canon) on Nov 14, 2006 at 14:21 UTC
no attempt to explain what happened in your case, just one more data point: I ran your code snippet (that with the `\x{00e3}` char) on several Linux boxes and got a filename represented by the 4-byte sequence `"l \xC3 \xA3 s"` (i.e. the UTF-8 encoding, as expected). So, it doesn't seem to be Perl that's doing the conversion you observe...	[reply] [d/l] [select]