comment on

Unfortunately, I realise that I don't know how to refer to a previous thread.

But I did find the problem, illustrated by the attached showcase. It is a bit contrived, but it is hard to force Perl to explicitly show what happens. And I believe the problem is rather sneaky. This is not to say that, once you understand the problem, you cannot find a good explanation in the Perl Unicode docs.
In my original problematic program, the name of the directory which I am reading originates in some other program area, and is passed as an argument to the sub() that does the directory scanning. That's why somehow, it is internally utf8 and why, when I concatenated it with the current dir entry, I got a utf8 string which (sometimes) failed to work in -f and open().
I say it's sneaky because you get the following cases which give perplexing results :
- suppose you have a var $path="/abcd", but which for some reason has been sneakily utf8-marked internally by Perl.
- suppose you use this var as the name of a directory which you scan with readdir()
- suppose in that directory you have 2 files "josef.txt" and "andré.txt"
- suppose you read the entries one by one in $name, concatenate them with the directory name (as in $full="$path/$name"), and attempt to open the corresponding file $full
.. then open(F,"<$full") will work in one case, and fail in the other with a "No such file" error.

The attached program can be run in any empty directory. It will start by creating 2 directory entries (files), with the same name but using 2 different encodings. Then it re-reads the directory entries, appends the path and tests the combination.

#!/usr/bin/perl
use strict;
use warnings;
use Encode;

# At the Beginning, there was an iso-8859-1 name string..
my $testname = "Presentación.txt";
print "starting string [$testname] " . (Encode::is_utf8($testname) ? "
+(utf8)" : "(bytes)") . "\n";

my $fname_iso = $testname; # simply copying leaves it iso bytes
print "  creating file1 [$fname_iso] " . (Encode::is_utf8($fname_iso) 
+? "(utf8)" : "(bytes)") . "\n";
open(F1,'>:raw',$fname_iso) or die "cannot open F1 : $!";
print F1 "Hello 1\n";
close F1;

my $fname_utf8 = decode('iso-8859-1',$testname); # force internal utf8
print " creating file2 [$fname_utf8] " . (Encode::is_utf8($fname_utf8)
+ ? "(utf8)" : "(bytes)") . "\n";
open(F2,'>:raw',$fname_utf8) or die "cannot open F2 : $!";
print F2 "Hello 2\n";
close F2;

my $dir = "."; # that's iso bytes too by default

opendir(DIR,$dir);
my @entries = readdir DIR;
close DIR;
foreach (@entries) {
    next if $_ =~ /^\./;
    next if $_ =~ /\.pl$/; # skip myself too
    print "entry [$_] " . (Encode::is_utf8($_) ? "(utf8)" : "(bytes)")
+ . "\n";

    print "  first try :\n";
    if (-f "$dir/$_") { # like this, leaves it as bytes
        print "    passes the -f test,";
        unless (open(F1,'<',"$dir/$_") ) {
            print "  but cannot be opened : $!\n";
        } else {
            print "  and can be opened !\n";
            close F1;
        }
    } else {
        print "    fails the -f test\n";
    }

    print "  2d try :\n";
    my $fullpath = "${dir}/${_}"; # leaves it as bytes also
    print "  trying [$fullpath] " . (Encode::is_utf8($fullpath) ? "(ut
+f8)" : "(bytes)") . "\n";
    if (-f $fullpath) {
        print "    passes the -f test,";
        unless (open(F1,'<',$fullpath) ) {
            print "  but cannot be opened : $!\n";
        } else {
            print "  and can be opened !\n";
            close F1;
        }
    } else {
        print "    fails the -f test\n";
    }

    print "  3d try :\n";
    my $dir_utf = decode('iso-8859-1',$dir); # force internal utf8
    my $fullpath2 = "${dir_utf}/${_}"; # concatenate forces utf8 flag 
+on the whole
    print "  trying [$fullpath2] " . (Encode::is_utf8($fullpath2) ? "(
+utf8)" : "(bytes)") . "\n";
    if (-f $fullpath2) {
        print "    passes the -f test,";
        unless (open(F1,'<',$fullpath2) ) {
            print "  but cannot be opened : $!\n";
        } else {
            print "  and can be opened !\n";
            close F1;
        }
    } else {
        print "    fails the -f test,";
        unless (open(F1,'<',$fullpath2) ) {
            print "  and fails the open() : $!\n";
        }

    }

}

exit 0;
[download]

P.S. In the meantime, I still don't know how the original user managed to actually upload a file on my DAV server, from his Windows PC, and have the filename on the Linux server be utf8-encoded. All my attempts have resulted in iso-8859-1 names. I strongly suspect that his station was Windows XP Home (while mine is a Pro), and that the DAV client is not the same.

In reply to Re^6: utf8 in directory and filenames by soliplaya
in thread utf8 in directory and filenames by soliplaya

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.