in reply to UTF-8 lexicographic string sort

To implement "UTF-8 lexicographic sorting", you merely have to read in the filenames as UTF-8 (or, when reading them from the filesystem via File::Find, use Encode::decode to convert them to Unicode). Note that the filesystem APIs don't know about UTF-8 or any filename encodings, so you will have to encode the filenames appropriately when talking to the filesystem. Perl will do the rest when you sort them. For example, the following code should do what you describe:

use strict; use warnings; use File::Find; use Encode 'decode'; my @found_files; File::Find::find(sub { push @found_files, decode('UTF-8', $File::Find::name); }, '.'); @found_files = sort @found_files; for my $file (@found_files) { my $fs_name = encode('UTF-8', $file); open my $fh, '<', $fs_name or die "Couldn't open '$file': $!"; };

Replies are listed 'Best First'.
Re^2: UTF-8 lexicographic string sort
by rdiez (Acolyte) on Apr 23, 2020 at 12:02 UTC

    I am not sure that your code is correct.

    Let us look at this snippet your suggested:

      decode('UTF-8', $File::Find::name)

    Let us look at the documentation for Encode::decode:

    This function returns the string that results from decoding the scalar value OCTETS, assumed to be a sequence of octets in ENCODING, into Perl's internal form.

    Your code is therefore assuming that $File::Find::name is in UTF-8, but this may not be correct.

      Finding the correct encoding for the filesystem is up to you.

      I'm not aware of any good way to find/know the encoding of the names in a filesystem, so you will have to apply your own knowledge there.

        I do not understand why the encoding used by the filesystem is relevant. I assume that File::Find, and ultimately the Perl runtime, will abstract all that knowledge and give me a Perl string that my script can safely work with. It does not matter if the filesystem underneath is Windows NTFS and its encoding is UCS-2. The Perl string with the filename will certainly never have UCS-2 encoding.