Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Sorting Vietnamese text

by pdenisowski (Acolyte)
on Dec 22, 2013 at 17:08 UTC ( [id://1068103]=perlquestion: print w/replies, xml ) Need Help??

pdenisowski has asked for the wisdom of the Perl Monks concerning the following question:

Greetings all,

I have a very large UTF8 Vietnamese text file that I would like to sort in alphabetical order. The problem is that it seems almost every word processor, utility, etc. out there does not use what I would consider to be the "normal" Vietnamese alphabetical order, usually because it ignores the tone marks (dấu) or puts them in the wrong/random order.

For example, for the first 3 letters of the Vietnamese alphabet I would like to use this sort order:

aáàảãạăaáàảãạăắằẳẵặâấầẩẫậ

I've looked at all the different modules, etc. but none of them seem to do this "correctly" (the way most printed dictionaries do). I've also looked at dozens of web pages and can't make any of those examples work properly either.

Any ideas? I've struggled with this for years and would be eternally grateful to anyone who could figure this out.

Thanks,

Paul

(Here is the complete list of letters in the order in which I wish to order them)

aáàảãạăaáàảãạăắằẳẵặâấầẩẫậbcdđeéèẻẽẹêếềểễệfghiíìỉĩịjklmnoóòỏõọôốồổỗộơớờởỡợpqrstuúùủũụưứừửữựvwxyýỳỷỹỵz

Replies are listed 'Best First'.
Re: Sorting Vietnamese text
by farang (Chaplain) on Dec 22, 2013 at 18:27 UTC

    Update: Sorry, some errors in the code below. In particular, the constructor for the collator should be this.

    my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
    Then the sort method will work as intended. Try it with actual Vietnamese words.

    Unicode::Collate::Locale ought to help. Example code below not using code tags due to display bug with utf8 text.

    #!/usr/bin/env perl
    use v5.14;
    use warnings;
    use utf8::all;
    
    use Unicode::Collate::Locale;
    my $Collator = Unicode::Collate::Locale->new('vi');
    
    my @unsorted = qw(
                      a..7
                      ả..3
                      à..9
                      ạ..5
                      ã..4
                      á..1
                      ă..6
                      à..2
                      á..8
                     );
    
    my @sorted = $Collator->sort(@unsorted);
    
    say "unsorted\n@unsorted";
    say "sorted\n@sorted";
    
    Output is as follows.
    unsorted
    a..7 ả..3 à..9 ạ..5 ã..4 á..1 ă..6 à..2 á..8
    sorted
    á..1 à..2 ả..3 ã..4 ạ..5 ă..6 a..7 á..8 à..9
    

    Update #2: The code below actually is a correct example.

    #!/usr/bin/env perl
    use v5.14;
    use warnings;
    use utf8::all;
    use Unicode::Collate::Locale;
    
    my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
    
    my @unsorted = ('á', 'ả', 'ã', 'à', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' );
    my @sorted = $Collator->sort(@unsorted);
    
    say "unsorted\n@unsorted";
    say "sorted\n@sorted";
    
    Giving the output:
    unsorted
    á ả ã à ậ ă ạ ẫ a ẩ
    sorted
    a à ả ã á ạ ă ẩ ẫ ậ
    

      Giving the output:

      unsorted á ả ã à ậ ă ạ ẫ a ẩ

      sorted a à ả ã á ạ ă ẩ ẫ ậ

      Thanks, but there are two issues : (1) that's still not the correct sort order (á should come before à), and (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.

        (1) that's still not the correct sort order (á should come before à)
        I've no idea, but this page indicates the opposite. You may have to create a custom sort to override the default if desired.
        (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.
        Perhaps it has to do with normalization. I still get the same sort order when using it.
        #!/usr/bin/env perl
        use v5.14;
        use warnings;
        use utf8::all;
        use Unicode::Collate::Locale;
        use Unicode::Normalize;
        
        my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
        my @unsorted = ('á', 'ả', 'ã', 'à', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' );
        @unsorted = map { NFD($_) } @unsorted;
        my @sorted = $Collator->sort(@unsorted);
        
        say NFC("unsorted\n@unsorted");
        say NFC("sorted\n@sorted");
        

Re: Sorting Vietnamese text
by Atacama (Sexton) on Dec 22, 2013 at 18:48 UTC
    Given the file is large, you can try using Sort::External supplied with the following sortsub: sub { $index{$Sort::External::a} <=> $index{$Sort::External::b} } Where %index is a hash with letters from the vietnamese alphabet as its keys and their corresponding positions in the alphabet as its values.
      Thanks - could you post a small code example? I've not quite sure how to create and reference the index.
        moritz basically implemented a similar idea (that I formulated incorrectly in my message, btw). I think it will be better to use the unicode collation module advised above.
        #!/usr/bin/env perl use warnings; use strict; use Sort::External; use Unicode::Collate::Locale; my $in = shift // 'large-unsorted.txt'; my $out = shift // 'sorted.txt'; my $comparator = Unicode::Collate::Locale->new(locale =>'vi'); my $sorter = Sort::External->new ( sortsub => sub { $comparator->cmp($Sort::External::a, $Sort::Externa +l::b) } ); open my $unsorted, '<', $in or die $!; $sorter->feed($_) while <$unsorted>; $sorter->finish(outfile => $out);
Re: Sorting Vietnamese text
by karlgoethebier (Abbot) on Dec 22, 2013 at 18:20 UTC
    "...Vietnamese text file that I would like to sort..."

    Perhaps you can show (something from) the text file as well as an intuitive example what you expect/like to do...?

    Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

      Here's an example of a short list of words and definitions that I want to sort in the order described above

      ầm : loud, noisy

      ãm : to carry in the arms

      ấm chè : teapot

      ám số : password, code

      should be

      ám số : password, code

      ấm chè : teapot

      ầm : loud, noisy

      ãm : to carry in the arms

        Correction: should be

        ám số : password, code

        ãm : to carry in the arms

        ấm chè : teapot

        ầm : loud, noisy

Re: Sorting Vietnamese text
by taint (Chaplain) on Dec 22, 2013 at 20:49 UTC
Re: Sorting Vietnamese text
by Anonymous Monk on Dec 23, 2013 at 06:20 UTC
    Note: take what I say here with a grain of salt since I know no Vietnamese.

    Here's the Vietnamese alphabet sort order. And here's how to read that chart:

    • First column (darkest colour) has the letter in question
    • The other columns have the glyphs that sort under that letter
    • Therefore, ấ and Ầ and ậ sort under â (will be found in the dictionary under the heading 'â')
    • In the case where the two words are otherwise 100% equivalent (except for the diacritics), sort in the left-to-right order given in the chart.

    Here's how I handled Japanese sorting (hiragana only) based on a similar chart for Japanese:

      And here's what I came up
      sub make_sort_order {
      	my $str = shift;
      	$str =~
      	tr(aáàảãạăaáàảãạăắằẳẵặâấầẩẫậbcdđeéèẻẽẹêếềểễệfghiíìỉĩịjklmnoóòỏõọôốồổỗộơớờởỡợpqrstuúùủũụưứừửữựvwxyýỳỷỹỵz)
      	  (00000011111111111112222223456777777888888abcddddddefghijjjjjjkkkkkkllllllmnopqrrrrrrsssssstuvwwwwwwx)d;
      
      	return $str;
      }
      
      my @words = ('ầm', 'ãm', 'ấm chè', 'ám số');
      
      
      print $_->[1], "[n" for
      	sort { $a->[0] cmp $b->[0] || $a->[1] cmp $b->[1] }
      	map  { [ make_sort_order($_), $_ ] } @words;
      

      It's still missing a correct 'secondary sort' (for the edge case when the diacritic-stripped words are identical); it should not be difficult to add once someone figures out a suitable transliteration that sorts asciibetically.

        It's still missing a correct 'secondary sort' (for the edge case when the diacritic-stripped words are identical);

        (laughs) That's hardly an "edge case" in Vietnamese - there are thousands of minimal pairs where the only difference between the words is the diacritical marks. While it's possible to read and understand Vietnamese typed in (7-bit) ASCII without too much ambiguity (i.e. you can figure out what word is meant from the context), this obviously wouldn't work for a dictionary.

        The other issue is that the words in the dictionary need to be sorted in the "correct" order for me to detect duplicates, etc.

        I'll try out your suggestion later today - thanks again!

Re: Sorting Vietnamese text
by pdenisowski (Acolyte) on Dec 27, 2013 at 00:50 UTC
    Thanks again to all for the replies -- they've been very helpful.

    My Vietnamese-English dictionary project went live on Christmas, here's a link:

    http://www.denisowski.org/Vietnamese/Vietnamese.html

    This is the 52,000+ line UTF8 text file I'm trying to sort the way I described. :)

    Thanks again!

    Paul

      I am trying to do this very thing with a Vietnamese text. What finally worked?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1068103]
Approved by kevbot
Front-paged by kevbot
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-04-20 10:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found