Looking at the OP code snippets, there may have been a typo there, so I'd like to get better clarification. Can you post a minimal, fully-functional and self-contained script that demonstrates your problem with the "-f" operator -- and have it produce some informative output that you can post along with the code? Something like this:
If the output of that snippet is piped to a hex dump tool (like "od -txC"), the result would be informative. I tried that on the directory where I created a utf8 file name, and I got what I expected: a two byte sequence in the file name that happens to work as a utf8 accented character. Doing nothing at all to the string returned by readdir(), I was able to get the correct stat info on that file.#!/usr/bin/perl use strict; binmode STDOUT, ":raw"; # make sure output is not "embellished" my $path = ( $ARGV[0] and -d $ARGV[0] ) ? shift : "."; opendir( D, $path ); my @files = grep /[^.]/, readdir( D ); for my $file ( @files ) { my $type = ( -f "$path/$file" ) ? "file" : ( -d _ ) ? "subd" : "othr"; print "$file ==$type\n"; }
(The apparent typo in the OP was: $entry = shift $A; -- I hope you have "use strict;" in there somewhere so that, assuming $A is not explicitly declared, this sort of typo would be a compile-time error? Actually, that statement would cause an error anyway, but still...)
If your linux server is anything like my bsd-based mac, the file names with above-ascii characters should show up as files (unless they happen to be directories), and their names should contain whatever byte sequence was delivered to the server by those Windows systems -- no matter what their encoding that may have been.
Update: Okay, as Juerd has been trying to explain, things are a little more dicey, thanks to Perl 5.8's "special treatment" of bytes/utf8-codepoints in the 0x80-0xff range. (The success I reported above was only true when the "wide character" in the file name was above U+00FF.)
When I tried the lengthy script you posted in your most recent reply, it broke at line 12 with "Invalid argument" -- for some reason, the string being passed to open() as a file name was not acceptable.
I found I could create a file name containing bytes in the 80-FF range if I did it like this:
But strangely -- and unfortunately -- the actual file name that appears in the directory turns out to be five bytes long instead of the expected four: "l a \xCC \x83 s".use Encode; $_=encode("utf8","l\x{00e3}s"); # make sure the utf8 flag is off! print "creating $_\n"; open( O,">",$_ ) or die "$_:$!\n"; print O "test\n" close O;
What happened? Perl (or an underlying library?) somehow decided to break the ã character into its two distinct components: the unadorned (ascii) "a", and the unicode "combining tilde" (U+0303, which shows up in utf8 as the two-byte sequence "\xCC \x83". Why/how is it happening? I don't know yet.
Well, that's an eye-opener. I wish I knew how to circumvent that sort of behavior, especially since it seems to apply only to bytes/codepoints in the 80-FF range.
In reply to Re: utf8 in directory and filenames
by graff
in thread utf8 in directory and filenames
by soliplaya
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |