Sorting Vietnamese text

pdenisowski has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Sorting Vietnamese text by farang (Chaplain) on Dec 22, 2013 at 18:27 UTC
Update: Sorry, some errors in the code below. In particular, the constructor for the collator should be this. `my $Collator = Unicode::Collate::Locale->new(locale =>'vi');` [download] Then the sort method will work as intended. Try it with actual Vietnamese words. Unicode::Collate::Locale ought to help. Example code below not using code tags due to display bug with utf8 text. #!/usr/bin/env perl use v5.14; use warnings; use utf8::all; use Unicode::Collate::Locale; my $Collator = Unicode::Collate::Locale->new('vi'); my @unsorted = qw( a..7 ả..3 à..9 ạ..5 ã..4 á..1 ă..6 à..2 á..8 ); my @sorted = $Collator->sort(@unsorted); say "unsorted\n@unsorted"; say "sorted\n@sorted"; Output is as follows. unsorted a..7 ả..3 à..9 ạ..5 ã..4 á..1 ă..6 à..2 á..8 sorted á..1 à..2 ả..3 ã..4 ạ..5 ă..6 a..7 á..8 à..9 Update #2: The code below actually is a correct example. #!/usr/bin/env perl use v5.14; use warnings; use utf8::all; use Unicode::Collate::Locale; my $Collator = Unicode::Collate::Locale->new(locale =>'vi'); my @unsorted = ('á', 'ả', 'ã', 'à', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' ); my @sorted = $Collator->sort(@unsorted); say "unsorted\n@unsorted"; say "sorted\n@sorted"; Giving the output: unsorted á ả ã à ậ ă ạ ẫ a ẩ sorted a à ả ã á ạ ă ẩ ẫ ậ	[reply] [d/l]
Re^2: Sorting Vietnamese text by pdenisowski (Acolyte) on Dec 22, 2013 at 20:07 UTC
Giving the output: unsorted á ả ã à ậ ă ạ ẫ a ẩ sorted a à ả ã á ạ ă ẩ ẫ ậ Thanks, but there are two issues : (1) that's still not the correct sort order (á should come before à), and (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.	[reply]
Re^3: Sorting Vietnamese text by farang (Chaplain) on Dec 22, 2013 at 23:48 UTC
(1) that's still not the correct sort order (á should come before à) I've no idea, but this page indicates the opposite. You may have to create a custom sort to override the default if desired. (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks. Perhaps it has to do with normalization. I still get the same sort order when using it. #!/usr/bin/env perl use v5.14; use warnings; use utf8::all; use Unicode::Collate::Locale; use Unicode::Normalize; my $Collator = Unicode::Collate::Locale->new(locale =>'vi'); my @unsorted = ('á', 'ả', 'ã', 'à', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' ); @unsorted = map { NFD($_) } @unsorted; my @sorted = $Collator->sort(@unsorted); say NFC("unsorted\n@unsorted"); say NFC("sorted\n@sorted");	[reply]
Re^4: Sorting Vietnamese text by pdenisowski (Acolyte) on Dec 23, 2013 at 02:37 UTC
Re^5: Sorting Vietnamese text by farang (Chaplain) on Dec 23, 2013 at 04:28 UTC
Some notes below your chosen depth have not been shown here
Re^5: Sorting Vietnamese text by Jim (Curate) on Dec 23, 2013 at 03:35 UTC
Re^4: Sorting Vietnamese text by pdenisowski (Acolyte) on Dec 23, 2013 at 00:08 UTC
Re: Sorting Vietnamese text by Atacama (Sexton) on Dec 22, 2013 at 18:48 UTC
Given the file is large, you can try using Sort::External supplied with the following sortsub: `sub { $index{$Sort::External::a} <=> $index{$Sort::External::b} }` Where %index is a hash with letters from the vietnamese alphabet as its keys and their corresponding positions in the alphabet as its values.	[reply] [d/l]
Re^2: Sorting Vietnamese text by pdenisowski (Acolyte) on Dec 22, 2013 at 20:09 UTC
Thanks - could you post a small code example? I've not quite sure how to create and reference the index.	[reply]
Re^3: Sorting Vietnamese text by Atacama (Sexton) on Dec 23, 2013 at 00:35 UTC
moritz basically implemented a similar idea (that I formulated incorrectly in my message, btw). I think it will be better to use the unicode collation module advised above. `#!/usr/bin/env perl use warnings; use strict; use Sort::External; use Unicode::Collate::Locale; my $in = shift // 'large-unsorted.txt'; my $out = shift // 'sorted.txt'; my $comparator = Unicode::Collate::Locale->new(locale =>'vi'); my $sorter = Sort::External->new ( sortsub => sub { $comparator->cmp($Sort::External::a, $Sort::Externa +l::b) } ); open my $unsorted, '<', $in or die $!; $sorter->feed($_) while <$unsorted>; $sorter->finish(outfile => $out);` [download]	[reply] [d/l]
Re: Sorting Vietnamese text by karlgoethebier (Abbot) on Dec 22, 2013 at 18:20 UTC
"...Vietnamese text file that I would like to sort..." Perhaps you can show (something from) the text file as well as an intuitive example what you expect/like to do...? Regards, Karl «The Crux of the Biscuit is the Apostrophe»	[reply]
Re^2: Sorting Vietnamese text by pdenisowski (Acolyte) on Dec 22, 2013 at 20:03 UTC
Here's an example of a short list of words and definitions that I want to sort in the order described above ầm : loud, noisy ãm : to carry in the arms ấm chè : teapot ám số : password, code should be ám số : password, code ấm chè : teapot ầm : loud, noisy ãm : to carry in the arms	[reply]
Re^3: Sorting Vietnamese text by pdenisowski (Acolyte) on Dec 22, 2013 at 20:11 UTC
Correction: should be ám số : password, code ãm : to carry in the arms ấm chè : teapot ầm : loud, noisy	[reply]
Re^4: Sorting Vietnamese text by moritz (Cardinal) on Dec 22, 2013 at 21:04 UTC
Re^5: Sorting Vietnamese text by pdenisowski (Acolyte) on Dec 23, 2013 at 16:05 UTC
(accidental duplicate, please reap) by moritz (Cardinal) on Dec 22, 2013 at 21:03 UTC
Re: Sorting Vietnamese text by taint (Chaplain) on Dec 22, 2013 at 20:49 UTC
Greetings, pdenisowski. I'm also working on something somewhat related (How best to avoid mojibake, when attempting to automatically convert documents to utf-8?). I mention it because a Perl Module Unicode::Tussle was suggested. Which has a couple of utilities in it, you might find helps you with this. ¡λɐp ʇɑəɹ⅁ ɐ əʌɐɥ puɐ ʻꜱdləɥ ƨᴉɥʇ ədoH --Chris Yes. What say about me, is true.	[reply]
Re: Sorting Vietnamese text by Anonymous Monk on Dec 23, 2013 at 06:20 UTC
Note: take what I say here with a grain of salt since I know no Vietnamese. Here's the Vietnamese alphabet sort order. And here's how to read that chart: First column (darkest colour) has the letter in question The other columns have the glyphs that sort under that letter Therefore, ấ and Ầ and ậ sort under â (will be found in the dictionary under the heading 'â') In the case where the two words are otherwise 100% equivalent (except for the diacritics), sort in the left-to-right order given in the chart. Here's how I handled Japanese sorting (hiragana only) based on a similar chart for Japanese: Read more... (1504 Bytes)	[reply]
Re^2: Sorting Vietnamese text by Anonymous Monk on Dec 23, 2013 at 07:31 UTC
And here's what I came up sub make_sort_order { my $str = shift; $str =~ tr(aáàảãạăaáàảãạăắằẳẵặâấầẩẫậbcdđeéèẻẽẹêếềểễệfghiíìỉĩịjklmnoóòỏõọôốồổỗộơớờởỡợpqrstuúùủũụưứừửữựvwxyýỳỷỹỵz) (00000011111111111112222223456777777888888abcddddddefghijjjjjjkkkkkkllllllmnopqrrrrrrsssssstuvwwwwwwx)d; return $str; } my @words = ('ầm', 'ãm', 'ấm chè', 'ám số'); print $_->[1], "[n" for sort { $a->[0] cmp $b->[0] \|\| $a->[1] cmp $b->[1] } map { [ make_sort_order($_), $_ ] } @words; It's still missing a correct 'secondary sort' (for the edge case when the diacritic-stripped words are identical); it should not be difficult to add once someone figures out a suitable transliteration that sorts asciibetically.	[reply]
Re^3: Sorting Vietnamese text by pdenisowski (Acolyte) on Dec 23, 2013 at 15:10 UTC
It's still missing a correct 'secondary sort' (for the edge case when the diacritic-stripped words are identical); (laughs) That's hardly an "edge case" in Vietnamese - there are thousands of minimal pairs where the only difference between the words is the diacritical marks. While it's possible to read and understand Vietnamese typed in (7-bit) ASCII without too much ambiguity (i.e. you can figure out what word is meant from the context), this obviously wouldn't work for a dictionary. The other issue is that the words in the dictionary need to be sorted in the "correct" order for me to detect duplicates, etc. I'll try out your suggestion later today - thanks again!	[reply]
Re^4: Sorting Vietnamese text by Anonymous Monk on Dec 23, 2013 at 18:47 UTC
Re^4: Sorting Vietnamese text by Anonymous Monk on Dec 23, 2013 at 19:04 UTC
Re: Sorting Vietnamese text by pdenisowski (Acolyte) on Dec 27, 2013 at 00:50 UTC
Thanks again to all for the replies -- they've been very helpful. My Vietnamese-English dictionary project went live on Christmas, here's a link: http://www.denisowski.org/Vietnamese/Vietnamese.html This is the 52,000+ line UTF8 text file I'm trying to sort the way I described. :) Thanks again! Paul	[reply]
Re^2: Sorting Vietnamese text by Anonymous Monk on Jan 17, 2014 at 07:36 UTC
I am trying to do this very thing with a Vietnamese text. What finally worked?	[reply]


P is for Practical
	PerlMonks