http://qs1969.pair.com?node_id=1199723

ruimelo73 has asked for the wisdom of the Perl Monks concerning the following question:

I'm portuguese and like so many people that live in countries with latin languages (portuguese, spanish, french, italian, etc) I have to deal with accent file names. Other non-latin languages have the same problem for sure (german, dutch, etc). The context here is Windows using NTFS drives, using Unicode to set up the files names. I'm using the latest perl version, that supports Unicode.

For example, I have a directory/folder in "c:\users\someuser\documents" named "documentação" ("documentation" in english). The full path will be "c:\users\someuser\documents\documentação". Now, if I do this:

use strict; our $dp; $dp = "c:\\Users\\someuser\\Documents\\documentação"; if (-d $dp) { print "ok\n"; } else { print "nope\n"; }

It will return "nope"...
If I change the text to "documenta\x{00E7}\x{00E3}o", it returns "ok"...
Printing the string variable will show the same thing...
If I use opendir/readdir in the "c:\users\someuser\documents" directory it will read "documentação" perfectly and -d will work fine...
The -d simply does not work with the direct text on the string variable...
If I add code to set the variable using command line argument in a dos console it will return "ok" also.

I wasted hours reading unicode and perl documentation, and trying diferent methods (utf8, encoding, deconding, locale, etc) for correcting this, but nothing works. It is a problem with the way perl codifies the string internaly. I suppose that using some sort of perl command line option would do some thing that could solve the issue but this is not the way to resolve this.

(post edited meanwhile, the solution I have found did not work)

Unicode is a wonderful thing but reading about the evolution of Unicode you start thinking that Unicode is now on the same level of confusion to what happened to the ancient codepages... I hope that some one teachs me a lesson, or this sort of weirdness can be solved in future versions of perl.

Thank you / Obrigado.

Replies are listed 'Best First'.
Re: Accent file names issue
by hippo (Bishop) on Sep 20, 2017 at 11:07 UTC
    I wasted hours reading unicode and perl documentation, and trying diferent methods (utf8, encoding, deconding, locale, etc) for correcting this, but nothing works.

    The following works fine for me on Linux. I have used UTF-8 throughout, including the use utf8; pragma in the code. If you are using some non-standard MS encoding at any point then this will surely fail.

    $ mkdir documentação $ cat dircode.t use strict; use warnings; use utf8; use Encode qw/encode decode/; use Test::More tests => 4; my $dir_with_codes = "documenta\x{00E7}\x{00E3}o"; my $dir_without_codes = 'documentação'; ok (-d encode ('UTF-8', $dir_with_codes), "With codes"); ok (-d encode ('UTF-8', $dir_without_codes), "Without codes"); my ($globbed) = <docu*>; $globbed = decode ('UTF-8', $globbed); is ($globbed, $dir_with_codes, "Glob matches with codes"); is ($globbed, $dir_without_codes, "Glob matches without codes"); $ perl dircode.t 1..4 ok 1 - With codes ok 2 - Without codes ok 3 - Glob matches with codes ok 4 - Glob matches without codes $
      FWIW: MS likes to use UTF-16 for it's Unicode encoding, and, in general, perl does not compile against MS' "wide character" API. Using the Win32 family of modules may be useful here.

      The other interesting caveat is normalization. MS enforces no normalization at all, while Apple has a proprietary normalization. See Unicode::Normalize::Mac

      TJD

      Hi,

      Looks like the -X EXPR functions do not use the right representation of the utf8 octets

      Try this tiny test script and you will see.

      (However, I do not really like this -d decode_u8($dir_without_codes) statement, but it works on Windows)

      use strict; use warnings; use utf8; use feature 'unicode_strings'; use charnames ':full'; use Test::More tests => 8; my $dir_with_codes = "documenta\x{00E7}\x{00E3}o"; my $dir_without_codes = "documentação"; my $intrnl_with_codes = "documenta\347\343o"; print "looking for directory (dir_without_codes): $dir_without_codes\n +"; ok (-d $dir_with_codes, "With codes (1)"); ok (-d $dir_without_codes, "Without codes (2)"); ### <--- Not OK ok (-d decode_u8($dir_without_codes), "Without codes (3)"); ### Or you can do: my $new_dir_without_codes = $dir_without_codes; my $success = utf8::decode($new_dir_without_codes); ok (-d $new_dir_without_codes, "Without codes (4)"); my ($globbed) = <docu*>; is ($globbed, $dir_with_codes, "Glob matches with codes (5)"); is ($globbed, $dir_without_codes, "Glob matches without codes (6)"); is ($globbed, decode_u8($dir_without_codes), "Glob matches without cod +es (7)"); ok (-e decode_u8($dir_without_codes), "Without codes (8)"); sub encode_u8 { my $s = shift; utf8::encode($s); $s }; sub decode_u8 { my $s = shift; utf8::decode($s); $s };
        Try this tiny test script and you will see.

        I see that every one of those tests fails:

        $ perl 1199749.pl 1..8 Malformed UTF-8 character (unexpected non-continuation byte 0xe3, imme +diately after start byte 0xe7) at 1199749.pl line 12. Malformed UTF-8 character (unexpected non-continuation byte 0x6f, imme +diately after start byte 0xe3) at 1199749.pl line 12. looking for directory (dir_without_codes): documentao not ok 1 - With codes (1) # Failed test 'With codes (1)' # at 1199749.pl line 17. not ok 2 - Without codes (2) # Failed test 'Without codes (2)' # at 1199749.pl line 18. not ok 3 - Without codes (3) # Failed test 'Without codes (3)' # at 1199749.pl line 19. not ok 4 - Without codes (4) # Failed test 'Without codes (4)' # at 1199749.pl line 24. not ok 5 - Glob matches with codes (5) # Failed test 'Glob matches with codes (5)' # at 1199749.pl line 28. # got: 'documentação' # expected: 'documentação' not ok 6 - Glob matches without codes (6) # Failed test 'Glob matches without codes (6)' # at 1199749.pl line 29. # got: 'documentação' # expected: 'documentao' not ok 7 - Glob matches without codes (7) # Failed test 'Glob matches without codes (7)' # at 1199749.pl line 30. # got: 'documentação' # expected: 'documentao' not ok 8 - Without codes (8) # Failed test 'Without codes (8)' # at 1199749.pl line 32. # Looks like you failed 8 tests of 8.

        Unfortunately it appears that it isn't portable. I take it that it runs better on Windows?

      Thank you for your reply. Your solution did not work, and I tried it already while searching for solutions. Linux and Windows or whatever OS should work equally on this issue, since Unicode was defined to be used widely, and not OS depedent. Everytime I had some problem I found a solution and then develop some routine or library to use everywhere for that context. In this case I'm going mad. I will see the other contributions to get a solution but my theory is still on the internal codifying of strings in Perl.

        In countries like yours where non-ASCII characters are rather to be expected, UTF-16 encoding of Unicode is probably more to be expected than UTF-8.
Re: Accent file names issue
by jahero (Pilgrim) on Sep 20, 2017 at 12:55 UTC
Re: Accent file names issue
by vr (Curate) on Sep 20, 2017 at 13:18 UTC

    To add to link jahero provided, there's "language for non-Unicode programs" in Control Panel UI. If your paths use only characters belonging to the "code page" chosen there (as probably case of most people), try this:

    use strict; use warnings; use feature 'say'; use utf8; use Win32; use Encode qw/ encode decode /; use File::Spec::Functions; my $parent = canonpath 'c:/Users/someuser/Documents'; my $folder = 'documentação'; my $path = catdir $parent, $folder; say Win32::GetACP; # 'ANSI Code Page' say Win32::GetOEMCP; # 'OEM Code Page' say 'ok' if -d encode('CP'. Win32::GetACP, $path); say 'ok' if decode('CP'. Win32::GetOEMCP, qx(dir $parent)) =~ /$folder +/;

    Decode from OEMCP, what Windows commands return ('dir', etc.), if you ever need their output.

    Decode from ACP what Perl's commands ('readdir', etc.) return. And encode to ACP, as above, to reach out from Perl and Unicode to Windows and "non-Unicode programs", e.g. with file tests, file access, copying, etc.

    Things get more messy if your paths use characters outside of said "code page".

    If I use opendir/readdir in the "c:\users\someuser\documents" directory it will read "documentação" perfectly

    No. It's not Unicode string (no utf8 flag) it returns. It's encoded in 'ANSI Code Page'. That's why "-d will work fine".

    Edit: minor clarifications. + P.S. So, first you encode to ACP an utf-8 path for argument to e.g. opendir, and then decode from ACP each element of readdir's return list, to work in Perl with normal Unicode strings.

    P.P.S. Oh, dir $parent must be encoded, too, if non-ASCII characters are involved. Let it be an exercise to the reader, to which 'code page' :).

      Thank you for your reply. If you look to all these "tricks" you start thinking that perl unicode support (at least for the windows universe) is going in the wrong way. In the old days of codepages, people knew what was going on from the OS itself, perl did not have much to do with it. With all this unicode stuff going into perl string internals, people lost the control and are unable to move on with simple solutions. I have never found such annoying problem, this was not for what unicode was created for.

      Look at the pieces of code that people are publishing here... it is madness... simple scripts now have to include weird code like "utf8", "Encode", "Decode", etc (like a secret project) just to handle string variables... I understand the utf8 and other requirements posted here, but this is not the way, really... this is not the old perl glamour I once fell in love... the ç, ã and other latin languages characters are used by thousands of millions, world wide, it's a huge problem and I can't find a simple and elegant solution yet for handling file names. Future developings of perl should change radicaly this, people within latin languages countries will be fed up of perl rapidaly. Unicode handling is dificult, we all know this, but in perl is going nuts.

      Sorry if I am exagerating but I am stuck in some projects because of this ridiculous problem. I'm wasting hours of searching tricks instead of working on code.

        Hello ruimelo73,

        my warmest welcome to the monastery!!

        > it is madness... Unicode handling is dificult.. ridiculous problem..

        welcome to the post Babel Tower era!

        I'm with you: it is difficult but is the reality to be difficult not the Perl way.

        I suggest you a very informative reading: tchrist about Perl and Unicode: No magic bullet (SO)

        You must be patient and laborious to get it right; it's a narrow path but with perl it's possible.

        Many monks here are skilled at this kind of problems (not me) and you can learn a lot from them.

        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
        perl6 -e '$_ = "/bäçelor"; mkdir $_ or die $!; say .IO.d && .IO.e' True


        holli

        You can lead your users to water, but alas, you cannot drown them.
Re: Accent file names issue
by haukex (Archbishop) on Sep 21, 2017 at 07:48 UTC

    It's been a while since I did battle with non-ASCII filenames on Windows, but from what I remember, Windows has some "quirks" in how it returns non-ASCII filenames to Perl, which is probably the root of the problem. While I haven't used these modules directly, I have seen them recommended in various other threads on similar topics (1113737, 1162804, 1170561, 1195063), and so I just wanted to highlight the AM's suggestion of Win32::Unicode. Also two of those threads mention Win32::LongPath as an alternative.

    Update: More references: 1220286, 1210032, 1078587

Re: Accent file names issue (Win32::Unicode )
by Anonymous Monk on Sep 20, 2017 at 15:14 UTC
Re: Accent file names issue
by swl (Parson) on Sep 20, 2017 at 21:38 UTC
Re: Accent file names issue
by ruimelo73 (Novice) on Sep 21, 2017 at 21:47 UTC

    I used the suggestions given by "Anonymous Monk" and I managed to find a solution to handle correctly string variables with direct text. Here's the code I created to test some common stuff:

    use strict; use warnings; use utf8; use feature 'unicode_strings'; use charnames ':full'; our $base = "c:\\users\\someuser\\documents"; our $dp = "$base\\documentação"; { my $new_dp = $dp; my $success = utf8::decode($new_dp); print "isutf8 test 1: "; if (utf8::is_utf8($dp)) { print "ok\n"; } else { print "nope\n"; } print "isutf8 test 2: "; if (utf8::is_utf8($new_dp)) { print "ok\n"; } else { print "nope\n"; } print "-d test 1: "; if (-d $dp) { print "ok\n"; } else { print "nope\n"; }; print "-d test 2: "; if (-d $new_dp) { print "ok\n"; } else { print "nope\n"; }; my $dh; print "opendir test 1: "; if (opendir($dh, $dp)) { print "ok\n"; close($dh); } else { print "nope\n"; } print "opendir test 2: "; if (opendir($dh, $new_dp)) { print "ok\n"; close($dh); } else { print "nope\n"; } my $buf; print "dir test 1: "; $buf = `dir /b $dp 2> nul`; chop($buf); if ($buf ne "") { print "ok\n"; } else { print "nope\n"; } print "dir test 2: "; $buf = `dir /b $new_dp 2> nul`; chop($buf); if ($buf ne "") { print "ok\n"; } else { print "nope\n"; } my $r; print "dir test 1: "; $r = system("dir $dp > nul 2> nul"); if ($r == 0) { print "ok\n"; } else { print "nope\n"; } print "dir test 2: "; $r = system("dir $new_dp > nul 2> nul"); if ($r == 0) { print "ok\n"; } else { print "nope\n"; } }

    The output was:

    C:\...>perl perlmonks2.pl isutf8 test 1: ok isutf8 test 2: nope -d test 1: nope -d test 2: ok opendir test 1: nope opendir test 2: ok dir test 1: nope dir test 2: ok dir test 1: nope dir test 2: ok C:\...>

    I will have to do some tests in other computers and with UNC file paths, but it seems that this is it. Now that it is working I can even live with it, but I really hope that future developing on Unicode and Perl can reduce the obscure use of lines of code for doing such simple things.

    You guys are great, I wish all the best for who ever try to help.

    Experience is the mother of all things, from it we kwnow radicaly the truth -- Duarte Pacheco Pereira, portuguese adventurer and the secret discoverer of America

      Note: you only get away succeed with what you did, because Portuguese uses Latin1, and because utf8::decode modifies its argument even on failure (which is surprising).

      The utf8::decode expects octets, but you're trying to feed it with proper Unicode. Of course it fails (check return value), but nevertheless, for whatever vague reason, it converts the utf8 string to latin1-encoded string, -- only when, it seems, such conversion is possible.

      use strict; use warnings; use utf8; my $x1 = my $x2 = 'ç'; die if utf8::decode($x2); use Devel::Peek; Dump $x1; Dump $x2;

      >perl test170921.pl SV = PV(0xe1af88) at 0xd23560 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0xd6b458 "\303\247"\0 [UTF8 "\x{e7}"] CUR = 2 LEN = 10 COW_REFCNT = 1 SV = PV(0xe1af58) at 0xd22fc0 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0xd6b5a8 "\347"\0 CUR = 1 LEN = 10

      So this encoding from utf8 to latin1 (even though you did it inadvertently with call to decode) is just a special narrow case of what I wrote in this thread about encoding to your system codepage when reaching outside from Perl, and decoding whatever you fetch back.