http://qs1969.pair.com?node_id=1199725


in reply to Accent file names issue

I wasted hours reading unicode and perl documentation, and trying diferent methods (utf8, encoding, deconding, locale, etc) for correcting this, but nothing works.

The following works fine for me on Linux. I have used UTF-8 throughout, including the use utf8; pragma in the code. If you are using some non-standard MS encoding at any point then this will surely fail.

$ mkdir documentação $ cat dircode.t use strict; use warnings; use utf8; use Encode qw/encode decode/; use Test::More tests => 4; my $dir_with_codes = "documenta\x{00E7}\x{00E3}o"; my $dir_without_codes = 'documentação'; ok (-d encode ('UTF-8', $dir_with_codes), "With codes"); ok (-d encode ('UTF-8', $dir_without_codes), "Without codes"); my ($globbed) = <docu*>; $globbed = decode ('UTF-8', $globbed); is ($globbed, $dir_with_codes, "Glob matches with codes"); is ($globbed, $dir_without_codes, "Glob matches without codes"); $ perl dircode.t 1..4 ok 1 - With codes ok 2 - Without codes ok 3 - Glob matches with codes ok 4 - Glob matches without codes $

Replies are listed 'Best First'.
Re^2: Accent file names issue
by Anonymous Monk on Sep 20, 2017 at 12:57 UTC
    FWIW: MS likes to use UTF-16 for it's Unicode encoding, and, in general, perl does not compile against MS' "wide character" API. Using the Win32 family of modules may be useful here.

    The other interesting caveat is normalization. MS enforces no normalization at all, while Apple has a proprietary normalization. See Unicode::Normalize::Mac

    TJD
Re^2: Accent file names issue
by Anonymous Monk on Sep 20, 2017 at 14:45 UTC

    Hi,

    Looks like the -X EXPR functions do not use the right representation of the utf8 octets

    Try this tiny test script and you will see.

    (However, I do not really like this -d decode_u8($dir_without_codes) statement, but it works on Windows)

    use strict; use warnings; use utf8; use feature 'unicode_strings'; use charnames ':full'; use Test::More tests => 8; my $dir_with_codes = "documenta\x{00E7}\x{00E3}o"; my $dir_without_codes = "documentação"; my $intrnl_with_codes = "documenta\347\343o"; print "looking for directory (dir_without_codes): $dir_without_codes\n +"; ok (-d $dir_with_codes, "With codes (1)"); ok (-d $dir_without_codes, "Without codes (2)"); ### <--- Not OK ok (-d decode_u8($dir_without_codes), "Without codes (3)"); ### Or you can do: my $new_dir_without_codes = $dir_without_codes; my $success = utf8::decode($new_dir_without_codes); ok (-d $new_dir_without_codes, "Without codes (4)"); my ($globbed) = <docu*>; is ($globbed, $dir_with_codes, "Glob matches with codes (5)"); is ($globbed, $dir_without_codes, "Glob matches without codes (6)"); is ($globbed, decode_u8($dir_without_codes), "Glob matches without cod +es (7)"); ok (-e decode_u8($dir_without_codes), "Without codes (8)"); sub encode_u8 { my $s = shift; utf8::encode($s); $s }; sub decode_u8 { my $s = shift; utf8::decode($s); $s };
      Try this tiny test script and you will see.

      I see that every one of those tests fails:

      $ perl 1199749.pl 1..8 Malformed UTF-8 character (unexpected non-continuation byte 0xe3, imme +diately after start byte 0xe7) at 1199749.pl line 12. Malformed UTF-8 character (unexpected non-continuation byte 0x6f, imme +diately after start byte 0xe3) at 1199749.pl line 12. looking for directory (dir_without_codes): documentao not ok 1 - With codes (1) # Failed test 'With codes (1)' # at 1199749.pl line 17. not ok 2 - Without codes (2) # Failed test 'Without codes (2)' # at 1199749.pl line 18. not ok 3 - Without codes (3) # Failed test 'Without codes (3)' # at 1199749.pl line 19. not ok 4 - Without codes (4) # Failed test 'Without codes (4)' # at 1199749.pl line 24. not ok 5 - Glob matches with codes (5) # Failed test 'Glob matches with codes (5)' # at 1199749.pl line 28. # got: 'documentação' # expected: 'documentação' not ok 6 - Glob matches without codes (6) # Failed test 'Glob matches without codes (6)' # at 1199749.pl line 29. # got: 'documentação' # expected: 'documentao' not ok 7 - Glob matches without codes (7) # Failed test 'Glob matches without codes (7)' # at 1199749.pl line 30. # got: 'documentação' # expected: 'documentao' not ok 8 - Without codes (8) # Failed test 'Without codes (8)' # at 1199749.pl line 32. # Looks like you failed 8 tests of 8.

      Unfortunately it appears that it isn't portable. I take it that it runs better on Windows?

        As stated, this only works on Windows, also the OP asked for support for Windows

        But on Windows I myself would prefer to use Win32::LongPath, because it supports real Unicode Directories and also filename up to 32000 chars

        Have a look on the following script: (You have to create these directories first)

        use strict;
        use warnings;
        
        use utf8;
        use feature 'unicode_strings';
        use charnames ':full';
        
        binmode(STDOUT, ":unix:utf8");
        
        
        my @strange_dirs = (
        					'documentação',
        					'AC_RAÍZ_CERTICÁMARA_S',
        					'Катюша',
        					'Катюша',
        					'москва',
        					'ελληνικά-русский',
        										
        );
        
        for my $dir (@strange_dirs)	{
        	print "Looking for directory: [$dir]\n";
        
        	if (-d decode_u8($dir)) {
        		print "Directory found: [$dir]\n";
        	}
        	else	{
        		print "---> Error: Directory [$dir] not found <---\n";
        	}
        }
        
        print "---- now the same with Win32::Longpath ----\n";
        
        use Win32::LongPath;
         
        for my $dir (@strange_dirs)	{
        	print "Looking for directory: [$dir]\n";
        
        	if (testL ('d', $dir)) {	# same as -d $dir
        		print "Directory found: [$dir]\n";
        	}
        	else	{
        		print "---> Error: Directory [$dir] not found <---\n";
        	}
        }
         
        
        
        
        sub encode_u8 { my $s = shift; utf8::encode($s); $s };
        sub decode_u8 { my $s = shift; utf8::decode($s); $s };
        
        

        You will see that Win32::LongPath correctly displays all these directories, but -d decode_u8($dir) doesn't.

        My test Environment is: Windows 10 Version 10.0.14393, Perl 64-bit (revision 5 version 22 subversion 2)

Re^2: Accent file names issue
by ruimelo73 (Novice) on Sep 20, 2017 at 17:33 UTC

    Thank you for your reply. Your solution did not work, and I tried it already while searching for solutions. Linux and Windows or whatever OS should work equally on this issue, since Unicode was defined to be used widely, and not OS depedent. Everytime I had some problem I found a solution and then develop some routine or library to use everywhere for that context. In this case I'm going mad. I will see the other contributions to get a solution but my theory is still on the internal codifying of strings in Perl.

      In countries like yours where non-ASCII characters are rather to be expected, UTF-16 encoding of Unicode is probably more to be expected than UTF-8.