Re: Accent file names issue

in reply to Accent file names issue

I wasted hours reading unicode and perl documentation, and trying diferent methods (utf8, encoding, deconding, locale, etc) for correcting this, but nothing works.

The following works fine for me on Linux. I have used UTF-8 throughout, including the use utf8; pragma in the code. If you are using some non-standard MS encoding at any point then this will surely fail.

$ mkdir documentação
$ cat dircode.t 
use strict;
use warnings;
use utf8;
use Encode qw/encode decode/;

use Test::More tests => 4;

my $dir_with_codes = "documenta\x{00E7}\x{00E3}o";
my $dir_without_codes = 'documentação';

ok (-d encode ('UTF-8', $dir_with_codes), "With codes");
ok (-d encode ('UTF-8', $dir_without_codes), "Without codes"); 

my ($globbed) = <docu*>;
$globbed = decode ('UTF-8', $globbed);
is ($globbed, $dir_with_codes, "Glob matches with codes");
is ($globbed, $dir_without_codes, "Glob matches without codes");
$ perl dircode.t
1..4
ok 1 - With codes
ok 2 - Without codes
ok 3 - Glob matches with codes
ok 4 - Glob matches without codes
$
[download]

Comment on Re: Accent file names issue Select or Download Code

Replies are listed 'Best First'.
Re^2: Accent file names issue by Anonymous Monk on Sep 20, 2017 at 12:57 UTC
FWIW: MS likes to use UTF-16 for it's Unicode encoding, and, in general, perl does not compile against MS' "wide character" API. Using the Win32 family of modules may be useful here. The other interesting caveat is normalization. MS enforces no normalization at all, while Apple has a proprietary normalization. See Unicode::Normalize::Mac TJD	[reply]
Re^3: Accent file names issue by Anonymous Monk on Sep 20, 2017 at 20:08 UTC
Win32::Unicode or Win32::Unicode::Native is what you want	[reply]
Re^2: Accent file names issue by Anonymous Monk on Sep 20, 2017 at 14:45 UTC
Hi, Looks like the -X EXPR functions do not use the right representation of the utf8 octets Try this tiny test script and you will see. (However, I do not really like this -d decode_u8($dir_without_codes) statement, but it works on Windows) use strict; use warnings; use utf8; use feature 'unicode_strings'; use charnames ':full'; use Test::More tests => 8; my $dir_with_codes = "documenta\x{00E7}\x{00E3}o"; my $dir_without_codes = "documentação"; my $intrnl_with_codes = "documenta\347\343o"; print "looking for directory (dir_without_codes): $dir_without_codes\n +"; ok (-d $dir_with_codes, "With codes (1)"); ok (-d $dir_without_codes, "Without codes (2)"); ### <--- Not OK ok (-d decode_u8($dir_without_codes), "Without codes (3)"); ### Or you can do: my $new_dir_without_codes = $dir_without_codes; my $success = utf8::decode($new_dir_without_codes); ok (-d $new_dir_without_codes, "Without codes (4)"); my ($globbed) = <docu*>; is ($globbed, $dir_with_codes, "Glob matches with codes (5)"); is ($globbed, $dir_without_codes, "Glob matches without codes (6)"); is ($globbed, decode_u8($dir_without_codes), "Glob matches without cod +es (7)"); ok (-e decode_u8($dir_without_codes), "Without codes (8)"); sub encode_u8 { my $s = shift; utf8::encode($s); $s }; sub decode_u8 { my $s = shift; utf8::decode($s); $s }; [download]	[reply] [d/l]
Re^3: Accent file names issue by hippo (Bishop) on Sep 20, 2017 at 14:57 UTC
Try this tiny test script and you will see. I see that every one of those tests fails: $ perl 1199749.pl 1..8 Malformed UTF-8 character (unexpected non-continuation byte 0xe3, imme +diately after start byte 0xe7) at 1199749.pl line 12. Malformed UTF-8 character (unexpected non-continuation byte 0x6f, imme +diately after start byte 0xe3) at 1199749.pl line 12. looking for directory (dir_without_codes): documentao not ok 1 - With codes (1) # Failed test 'With codes (1)' # at 1199749.pl line 17. not ok 2 - Without codes (2) # Failed test 'Without codes (2)' # at 1199749.pl line 18. not ok 3 - Without codes (3) # Failed test 'Without codes (3)' # at 1199749.pl line 19. not ok 4 - Without codes (4) # Failed test 'Without codes (4)' # at 1199749.pl line 24. not ok 5 - Glob matches with codes (5) # Failed test 'Glob matches with codes (5)' # at 1199749.pl line 28. # got: 'documentação' # expected: 'documentação' not ok 6 - Glob matches without codes (6) # Failed test 'Glob matches without codes (6)' # at 1199749.pl line 29. # got: 'documentação' # expected: 'documentao' not ok 7 - Glob matches without codes (7) # Failed test 'Glob matches without codes (7)' # at 1199749.pl line 30. # got: 'documentação' # expected: 'documentao' not ok 8 - Without codes (8) # Failed test 'Without codes (8)' # at 1199749.pl line 32. # Looks like you failed 8 tests of 8. [download] Unfortunately it appears that it isn't portable. I take it that it runs better on Windows?	[reply] [d/l]
Re^4: Accent file names issue by Anonymous Monk on Sep 21, 2017 at 14:19 UTC
As stated, this only works on Windows, also the OP asked for support for Windows But on Windows I myself would prefer to use Win32::LongPath, because it supports real Unicode Directories and also filename up to 32000 chars Have a look on the following script: (You have to create these directories first) use strict; use warnings; use utf8; use feature 'unicode_strings'; use charnames ':full'; binmode(STDOUT, ":unix:utf8"); my @strange_dirs = ( 'documentação', 'AC_RAÍZ_CERTICÁMARA_S', 'ÐšÐ°Ñ‚ÑŽÑˆÐ°', 'Катюша', 'москва', 'ελληνικά-русский', ); for my $dir (@strange_dirs) { print "Looking for directory: [$dir]\n"; if (-d decode_u8($dir)) { print "Directory found: [$dir]\n"; } else { print "---> Error: Directory [$dir] not found <---\n"; } } print "---- now the same with Win32::Longpath ----\n"; use Win32::LongPath; for my $dir (@strange_dirs) { print "Looking for directory: [$dir]\n"; if (testL ('d', $dir)) { # same as -d $dir print "Directory found: [$dir]\n"; } else { print "---> Error: Directory [$dir] not found <---\n"; } } sub encode_u8 { my $s = shift; utf8::encode($s); $s }; sub decode_u8 { my $s = shift; utf8::decode($s); $s }; You will see that Win32::LongPath correctly displays all these directories, but -d decode_u8($dir) doesn't. My test Environment is: Windows 10 Version 10.0.14393, Perl 64-bit (revision 5 version 22 subversion 2)	[reply]
Re^2: Accent file names issue by ruimelo73 (Novice) on Sep 20, 2017 at 17:33 UTC
Thank you for your reply. Your solution did not work, and I tried it already while searching for solutions. Linux and Windows or whatever OS should work equally on this issue, since Unicode was defined to be used widely, and not OS depedent. Everytime I had some problem I found a solution and then develop some routine or library to use everywhere for that context. In this case I'm going mad. I will see the other contributions to get a solution but my theory is still on the internal codifying of strings in Perl.	[reply]
Re^3: Accent file names issue by Anonymous Monk on Sep 20, 2017 at 21:12 UTC
In countries like yours where non-ASCII characters are rather to be expected, UTF-16 encoding of Unicode is probably more to be expected than UTF-8.	[reply]

In Section Seekers of Perl Wisdom