comment on

I did some playing around, and discovered that I could create a file on my macosx with utf8 characters in the file name (using perl). Of course, to the unix-based OS, such a file name just has some bytes whose high bits are set; their interpretation as characters in a given encoding (and how many octets there may be per character) is of no consequence to the OS or the library calls that handle file access.

Looking at the OP code snippets, there may have been a typo there, so I'd like to get better clarification. Can you post a minimal, fully-functional and self-contained script that demonstrates your problem with the "-f" operator -- and have it produce some informative output that you can post along with the code? Something like this:

#!/usr/bin/perl

use strict;

binmode STDOUT, ":raw";  # make sure output is not "embellished"

my $path = ( $ARGV[0] and -d $ARGV[0] ) ? shift : ".";

opendir( D, $path );
my @files = grep /[^.]/, readdir( D );
for my $file ( @files ) {
    my $type = ( -f "$path/$file" ) ? "file" :
               ( -d _ ) ? "subd" : "othr";
    print "$file ==$type\n";
}
[download]

If the output of that snippet is piped to a hex dump tool (like "od -txC"), the result would be informative. I tried that on the directory where I created a utf8 file name, and I got what I expected: a two byte sequence in the file name that happens to work as a utf8 accented character. Doing nothing at all to the string returned by readdir(), I was able to get the correct stat info on that file.

(The apparent typo in the OP was: $entry = shift $A; -- I hope you have "use strict;" in there somewhere so that, assuming $A is not explicitly declared, this sort of typo would be a compile-time error? Actually, that statement would cause an error anyway, but still...)

If your linux server is anything like my bsd-based mac, the file names with above-ascii characters should show up as files (unless they happen to be directories), and their names should contain whatever byte sequence was delivered to the server by those Windows systems -- no matter what their encoding that may have been.

Update: Okay, as Juerd has been trying to explain, things are a little more dicey, thanks to Perl 5.8's "special treatment" of bytes/utf8-codepoints in the 0x80-0xff range. (The success I reported above was only true when the "wide character" in the file name was above U+00FF.)

When I tried the lengthy script you posted in your most recent reply, it broke at line 12 with "Invalid argument" -- for some reason, the string being passed to open() as a file name was not acceptable.

I found I could create a file name containing bytes in the 80-FF range if I did it like this:

use Encode;

$_=encode("utf8","l\x{00e3}s");  # make sure the utf8 flag is off!

print "creating $_\n";
open( O,">",$_ ) or die "$_:$!\n";
print O "test\n"
close O;
[download]

But strangely -- and unfortunately -- the actual file name that appears in the directory turns out to be five bytes long instead of the expected four: "l a \xCC \x83 s".

What happened? Perl (or an underlying library?) somehow decided to break the ã character into its two distinct components: the unadorned (ascii) "a", and the unicode "combining tilde" (U+0303, which shows up in utf8 as the two-byte sequence "\xCC \x83". Why/how is it happening? I don't know yet.

Well, that's an eye-opener. I wish I knew how to circumvent that sort of behavior, especially since it seems to apply only to bytes/codepoints in the 80-FF range.

In reply to Re: utf8 in directory and filenames by graff
in thread utf8 in directory and filenames by soliplaya

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.