Accent file names issue

ruimelo73 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Accent file names issue by hippo (Bishop) on Sep 20, 2017 at 11:07 UTC
I wasted hours reading unicode and perl documentation, and trying diferent methods (utf8, encoding, deconding, locale, etc) for correcting this, but nothing works. The following works fine for me on Linux. I have used UTF-8 throughout, including the `use utf8;` pragma in the code. If you are using some non-standard MS encoding at any point then this will surely fail. $ mkdir documentação $ cat dircode.t use strict; use warnings; use utf8; use Encode qw/encode decode/; use Test::More tests => 4; my $dir_with_codes = "documenta\x{00E7}\x{00E3}o"; my $dir_without_codes = 'documentação'; ok (-d encode ('UTF-8', $dir_with_codes), "With codes"); ok (-d encode ('UTF-8', $dir_without_codes), "Without codes"); my ($globbed) = <docu*>; $globbed = decode ('UTF-8', $globbed); is ($globbed, $dir_with_codes, "Glob matches with codes"); is ($globbed, $dir_without_codes, "Glob matches without codes"); $ perl dircode.t 1..4 ok 1 - With codes ok 2 - Without codes ok 3 - Glob matches with codes ok 4 - Glob matches without codes $ [download]	[reply] [d/l] [select]
Re^2: Accent file names issue by Anonymous Monk on Sep 20, 2017 at 12:57 UTC
FWIW: MS likes to use UTF-16 for it's Unicode encoding, and, in general, perl does not compile against MS' "wide character" API. Using the Win32 family of modules may be useful here. The other interesting caveat is normalization. MS enforces no normalization at all, while Apple has a proprietary normalization. See Unicode::Normalize::Mac TJD	[reply]
Re^3: Accent file names issue by Anonymous Monk on Sep 20, 2017 at 20:08 UTC
Win32::Unicode or Win32::Unicode::Native is what you want	[reply]
Re^2: Accent file names issue by Anonymous Monk on Sep 20, 2017 at 14:45 UTC
Hi, Looks like the -X EXPR functions do not use the right representation of the utf8 octets Try this tiny test script and you will see. (However, I do not really like this -d decode_u8($dir_without_codes) statement, but it works on Windows) use strict; use warnings; use utf8; use feature 'unicode_strings'; use charnames ':full'; use Test::More tests => 8; my $dir_with_codes = "documenta\x{00E7}\x{00E3}o"; my $dir_without_codes = "documentação"; my $intrnl_with_codes = "documenta\347\343o"; print "looking for directory (dir_without_codes): $dir_without_codes\n +"; ok (-d $dir_with_codes, "With codes (1)"); ok (-d $dir_without_codes, "Without codes (2)"); ### <--- Not OK ok (-d decode_u8($dir_without_codes), "Without codes (3)"); ### Or you can do: my $new_dir_without_codes = $dir_without_codes; my $success = utf8::decode($new_dir_without_codes); ok (-d $new_dir_without_codes, "Without codes (4)"); my ($globbed) = <docu*>; is ($globbed, $dir_with_codes, "Glob matches with codes (5)"); is ($globbed, $dir_without_codes, "Glob matches without codes (6)"); is ($globbed, decode_u8($dir_without_codes), "Glob matches without cod +es (7)"); ok (-e decode_u8($dir_without_codes), "Without codes (8)"); sub encode_u8 { my $s = shift; utf8::encode($s); $s }; sub decode_u8 { my $s = shift; utf8::decode($s); $s }; [download]	[reply] [d/l]
Re^3: Accent file names issue by hippo (Bishop) on Sep 20, 2017 at 14:57 UTC
Try this tiny test script and you will see. I see that every one of those tests fails: $ perl 1199749.pl 1..8 Malformed UTF-8 character (unexpected non-continuation byte 0xe3, imme +diately after start byte 0xe7) at 1199749.pl line 12. Malformed UTF-8 character (unexpected non-continuation byte 0x6f, imme +diately after start byte 0xe3) at 1199749.pl line 12. looking for directory (dir_without_codes): documentao not ok 1 - With codes (1) # Failed test 'With codes (1)' # at 1199749.pl line 17. not ok 2 - Without codes (2) # Failed test 'Without codes (2)' # at 1199749.pl line 18. not ok 3 - Without codes (3) # Failed test 'Without codes (3)' # at 1199749.pl line 19. not ok 4 - Without codes (4) # Failed test 'Without codes (4)' # at 1199749.pl line 24. not ok 5 - Glob matches with codes (5) # Failed test 'Glob matches with codes (5)' # at 1199749.pl line 28. # got: 'documentação' # expected: 'documentação' not ok 6 - Glob matches without codes (6) # Failed test 'Glob matches without codes (6)' # at 1199749.pl line 29. # got: 'documentação' # expected: 'documentao' not ok 7 - Glob matches without codes (7) # Failed test 'Glob matches without codes (7)' # at 1199749.pl line 30. # got: 'documentação' # expected: 'documentao' not ok 8 - Without codes (8) # Failed test 'Without codes (8)' # at 1199749.pl line 32. # Looks like you failed 8 tests of 8. [download] Unfortunately it appears that it isn't portable. I take it that it runs better on Windows?	[reply] [d/l]
Re^4: Accent file names issue by Anonymous Monk on Sep 21, 2017 at 14:19 UTC
Re^2: Accent file names issue by ruimelo73 (Novice) on Sep 20, 2017 at 17:33 UTC
Thank you for your reply. Your solution did not work, and I tried it already while searching for solutions. Linux and Windows or whatever OS should work equally on this issue, since Unicode was defined to be used widely, and not OS depedent. Everytime I had some problem I found a solution and then develop some routine or library to use everywhere for that context. In this case I'm going mad. I will see the other contributions to get a solution but my theory is still on the internal codifying of strings in Perl.	[reply]
Re^3: Accent file names issue by Anonymous Monk on Sep 20, 2017 at 21:12 UTC
In countries like yours where non-ASCII characters are rather to be expected, UTF-16 encoding of Unicode is probably more to be expected than UTF-8.	[reply]
Re: Accent file names issue by jahero (Pilgrim) on Sep 20, 2017 at 12:55 UTC
Short search reveals this thread: Perl on Windows: file names with accented characters, UTF-8 and -e. Seems to me there is useful "stuff" there.	[reply]
Re: Accent file names issue by vr (Curate) on Sep 20, 2017 at 13:18 UTC
To add to link jahero provided, there's "language for non-Unicode programs" in Control Panel UI. If your paths use only characters belonging to the "code page" chosen there (as probably case of most people), try this: `use strict; use warnings; use feature 'say'; use utf8; use Win32; use Encode qw/ encode decode /; use File::Spec::Functions; my $parent = canonpath 'c:/Users/someuser/Documents'; my $folder = 'documentação'; my $path = catdir $parent, $folder; say Win32::GetACP; # 'ANSI Code Page' say Win32::GetOEMCP; # 'OEM Code Page' say 'ok' if -d encode('CP'. Win32::GetACP, $path); say 'ok' if decode('CP'. Win32::GetOEMCP, qx(dir $parent)) =~ /$folder +/;` [download] Decode from OEMCP, what Windows commands return ('dir', etc.), if you ever need their output. Decode from ACP what Perl's commands ('readdir', etc.) return. And encode to ACP, as above, to reach out from Perl and Unicode to Windows and "non-Unicode programs", e.g. with file tests, file access, copying, etc. Things get more messy if your paths use characters outside of said "code page". If I use opendir/readdir in the "c:\users\someuser\documents" directory it will read "documentação" perfectly No. It's not Unicode string (no utf8 flag) it returns. It's encoded in 'ANSI Code Page'. That's why "-d will work fine". Edit: minor clarifications. + P.S. So, first you encode to ACP an utf-8 path for argument to e.g. `opendir`, and then decode from ACP each element of `readdir`'s return list, to work in Perl with normal Unicode strings. P.P.S. Oh, `dir $parent` must be encoded, too, if non-ASCII characters are involved. Let it be an exercise to the reader, to which 'code page' :).	[reply] [d/l] [select]
Re^2: Accent file names issue by ruimelo73 (Novice) on Sep 20, 2017 at 18:19 UTC
Thank you for your reply. If you look to all these "tricks" you start thinking that perl unicode support (at least for the windows universe) is going in the wrong way. In the old days of codepages, people knew what was going on from the OS itself, perl did not have much to do with it. With all this unicode stuff going into perl string internals, people lost the control and are unable to move on with simple solutions. I have never found such annoying problem, this was not for what unicode was created for. Look at the pieces of code that people are publishing here... it is madness... simple scripts now have to include weird code like "utf8", "Encode", "Decode", etc (like a secret project) just to handle string variables... I understand the utf8 and other requirements posted here, but this is not the way, really... this is not the old perl glamour I once fell in love... the ç, ã and other latin languages characters are used by thousands of millions, world wide, it's a huge problem and I can't find a simple and elegant solution yet for handling file names. Future developings of perl should change radicaly this, people within latin languages countries will be fed up of perl rapidaly. Unicode handling is dificult, we all know this, but in perl is going nuts. Sorry if I am exagerating but I am stuck in some projects because of this ridiculous problem. I'm wasting hours of searching tricks instead of working on code.	[reply]
Re^3: Accent file names issue -- Babel Tower by Discipulus (Canon) on Sep 20, 2017 at 19:49 UTC
Hello ruimelo73, my warmest welcome to the monastery!! > it is madness... Unicode handling is dificult.. ridiculous problem.. welcome to the post Babel Tower era! I'm with you: it is difficult but is the reality to be difficult not the Perl way. I suggest you a very informative reading: tchrist about Perl and Unicode: No magic bullet (SO) You must be patient and laborious to get it right; it's a narrow path but with perl it's possible. Many monks here are skilled at this kind of problems (not me) and you can learn a lot from them. L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re^3: Accent file names issue by holli (Abbot) on Sep 20, 2017 at 19:05 UTC
`perl6 -e '$_ = "/bäçelor"; mkdir $_ or die $!; say .IO.d && .IO.e' True` [download] holli You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l]
Re^4: Accent file names issue by ruimelo73 (Novice) on Sep 21, 2017 at 20:28 UTC
Re: Accent file names issue by haukex (Archbishop) on Sep 21, 2017 at 07:48 UTC
It's been a while since I did battle with non-ASCII filenames on Windows, but from what I remember, Windows has some "quirks" in how it returns non-ASCII filenames to Perl, which is probably the root of the problem. While I haven't used these modules directly, I have seen them recommended in various other threads on similar topics (1113737, 1162804, 1170561, 1195063), and so I just wanted to highlight the AM's suggestion of Win32::Unicode. Also two of those threads mention Win32::LongPath as an alternative. Update: More references: 1220286, 1210032, 1078587	[reply]
Re: Accent file names issue (Win32::Unicode ) by Anonymous Monk on Sep 20, 2017 at 15:14 UTC
See my exWin32::Unicode and module docs Win32::Unicode readdir... doesnt unicode	[reply]
Re: Accent file names issue by swl (Parson) on Sep 20, 2017 at 21:38 UTC
See also https://www.nu42.com/2017/02/perl-unicode-windows-trilogy-one.html and https://www.nu42.com/2017/02/unicode-windows-command-line.html	[reply]
Re: Accent file names issue by ruimelo73 (Novice) on Sep 21, 2017 at 21:47 UTC
I used the suggestions given by "Anonymous Monk" and I managed to find a solution to handle correctly string variables with direct text. Here's the code I created to test some common stuff: use strict; use warnings; use utf8; use feature 'unicode_strings'; use charnames ':full'; our $base = "c:\\users\\someuser\\documents"; our $dp = "$base\\documentação"; { my $new_dp = $dp; my $success = utf8::decode($new_dp); print "isutf8 test 1: "; if (utf8::is_utf8($dp)) { print "ok\n"; } else { print "nope\n"; } print "isutf8 test 2: "; if (utf8::is_utf8($new_dp)) { print "ok\n"; } else { print "nope\n"; } print "-d test 1: "; if (-d $dp) { print "ok\n"; } else { print "nope\n"; }; print "-d test 2: "; if (-d $new_dp) { print "ok\n"; } else { print "nope\n"; }; my $dh; print "opendir test 1: "; if (opendir($dh, $dp)) { print "ok\n"; close($dh); } else { print "nope\n"; } print "opendir test 2: "; if (opendir($dh, $new_dp)) { print "ok\n"; close($dh); } else { print "nope\n"; } my $buf; print "dir test 1: "; $buf = `dir /b $dp 2> nul`; chop($buf); if ($buf ne "") { print "ok\n"; } else { print "nope\n"; } print "dir test 2: "; $buf = `dir /b $new_dp 2> nul`; chop($buf); if ($buf ne "") { print "ok\n"; } else { print "nope\n"; } my $r; print "dir test 1: "; $r = system("dir $dp > nul 2> nul"); if ($r == 0) { print "ok\n"; } else { print "nope\n"; } print "dir test 2: "; $r = system("dir $new_dp > nul 2> nul"); if ($r == 0) { print "ok\n"; } else { print "nope\n"; } } [download] The output was: `C:\...>perl perlmonks2.pl isutf8 test 1: ok isutf8 test 2: nope -d test 1: nope -d test 2: ok opendir test 1: nope opendir test 2: ok dir test 1: nope dir test 2: ok dir test 1: nope dir test 2: ok C:\...>` [download] I will have to do some tests in other computers and with UNC file paths, but it seems that this is it. Now that it is working I can even live with it, but I really hope that future developing on Unicode and Perl can reduce the obscure use of lines of code for doing such simple things. You guys are great, I wish all the best for who ever try to help. Experience is the mother of all things, from it we kwnow radicaly the truth -- Duarte Pacheco Pereira, portuguese adventurer and the secret discoverer of America	[reply] [d/l] [select]
Re^2: Accent file names issue by vr (Curate) on Sep 22, 2017 at 00:40 UTC
Note: you only ~~get away~~ succeed with what you did, because Portuguese uses Latin1, and because `utf8::decode` modifies its argument even on failure (which is surprising). The `utf8::decode` expects octets, but you're trying to feed it with proper Unicode. Of course it fails (check return value), but nevertheless, for whatever vague reason, it converts the utf8 string to latin1-encoded string, -- only when, it seems, such conversion is possible. `use strict; use warnings; use utf8; my $x1 = my $x2 = 'ç'; die if utf8::decode($x2); use Devel::Peek; Dump $x1; Dump $x2;` [download] `>perl test170921.pl SV = PV(0xe1af88) at 0xd23560 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0xd6b458 "\303\247"\0 [UTF8 "\x{e7}"] CUR = 2 LEN = 10 COW_REFCNT = 1 SV = PV(0xe1af58) at 0xd22fc0 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0xd6b5a8 "\347"\0 CUR = 1 LEN = 10` [download] So this encoding from utf8 to latin1 (even though you did it inadvertently with call to `decode`) is just a special narrow case of what I wrote in this thread about encoding to your system codepage when reaching outside from Perl, and decoding whatever you fetch back.	[reply] [d/l] [select]


Your skill will accomplish what the force of many cannot
	PerlMonks