in reply to Sorting Vietnamese text

"...Vietnamese text file that I would like to sort..."

Perhaps you can show (something from) the text file as well as an intuitive example what you expect/like to do...?

Regards, Karl

«The Crux of the Biscuit is the Apostrophe»

Replies are listed 'Best First'.
Re^2: Sorting Vietnamese text
by pdenisowski (Acolyte) on Dec 22, 2013 at 20:03 UTC
    Here's an example of a short list of words and definitions that I want to sort in the order described above

    ầm : loud, noisy

    ãm : to carry in the arms

    ấm chè : teapot

    ám số : password, code

    should be

    ám số : password, code

    ấm chè : teapot

    ầm : loud, noisy

    ãm : to carry in the arms

      Correction: should be

      ám số : password, code

      ãm : to carry in the arms

      ấm chè : teapot

      ầm : loud, noisy

        I think that getting Unicode::Collate to work would be the best approach, but here's a hand-rolled one that seems to work the way you want it:

        use utf8;
        use 5.014;
        use warnings;
        use List::Util qw/min/;
        binmode STDOUT, ':encoding(UTF-8)';
        
        my %order;
        {
            my $source = join '', 'aáàảãạăaáàảãạăắ',
                         'ằẳẵặâấầẩẫậbcdđeéèẻẽẹêếềểễ',
                         'ệfghiíìỉĩịjklmnoóòỏõọôốồổ',
                         'ỗộơớờởỡợpqrstuúùủũụưứừửữự',
                         'vwxyýỳỷỹỵz';
            my $cnt = 0;
            $order{$_} = ++$cnt for split //, $source;
            sub vcmp($$) {
                my ($a, $b) = @_;
                for (0..min(length($a), length($b))) {
                    my $cmp = ($order{substr $a, $_, 1} // 0)
                              <=> ($order{ substr $b, $_, 1 } // 0);
                    return $cmp if $cmp != 0;
                }
                return length($a) <=> length($b);
            }
        }
        
        say for sort { vcmp($a, $b) } ('ầm', 'ãm', 'ấm chè', 'ám số');