Re: Sorting Vietnamese text
by farang (Chaplain) on Dec 22, 2013 at 18:27 UTC
|
Update: Sorry, some errors in the code below. In particular, the constructor for the collator should be this.
my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
Then the sort method will work as intended. Try it with actual Vietnamese words.
Unicode::Collate::Locale ought to help. Example code below not using code tags due to display bug with utf8 text.
#!/usr/bin/env perl
use v5.14;
use warnings;
use utf8::all;
use Unicode::Collate::Locale;
my $Collator = Unicode::Collate::Locale->new('vi');
my @unsorted = qw(
a..7
ả..3
à..9
ạ..5
ã..4
á..1
ă..6
à..2
á..8
);
my @sorted = $Collator->sort(@unsorted);
say "unsorted\n@unsorted";
say "sorted\n@sorted";
Output is as follows.
unsorted
a..7 ả..3 à..9 ạ..5 ã..4 á..1 ă..6 à..2 á..8
sorted
á..1 à..2 ả..3 ã..4 ạ..5 ă..6 a..7 á..8 à..9
Update #2: The code below actually is a correct example.
#!/usr/bin/env perl
use v5.14;
use warnings;
use utf8::all;
use Unicode::Collate::Locale;
my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
my @unsorted = ('á', 'ả', 'ã', 'à', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' );
my @sorted = $Collator->sort(@unsorted);
say "unsorted\n@unsorted";
say "sorted\n@sorted";
Giving the output:
unsorted
á ả ã à ậ ă ạ ẫ a ẩ
sorted
a à ả ã á ạ ă ẩ ẫ ậ
| [reply] [d/l] |
|
Giving the output:
unsorted
á ả ã à ậ ă ạ ẫ a ẩ
sorted
a à ả ã á ạ ă ẩ ẫ ậ
Thanks, but there are two issues : (1) that's still not the correct sort order (á should come before à), and (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.
| [reply] |
|
(1) that's still not the correct sort order
(á should come before à)
I've no idea, but
this page
indicates the opposite. You may have to
create a custom sort to override the default if desired.
(2) I actually get a
different "sorted" list when I run the same exact code. This is the
problem that I have - it seems the sort algorithms ignore the tone
marks.
Perhaps it has to do with normalization. I still get the same sort
order when using it.
#!/usr/bin/env perl
use v5.14;
use warnings;
use utf8::all;
use Unicode::Collate::Locale;
use Unicode::Normalize;
my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
my @unsorted = ('á', 'ả', 'ã', 'à', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' );
@unsorted = map { NFD($_) } @unsorted;
my @sorted = $Collator->sort(@unsorted);
say NFC("unsorted\n@unsorted");
say NFC("sorted\n@sorted");
| [reply] |
|
|
|
|
|
Re: Sorting Vietnamese text
by Atacama (Sexton) on Dec 22, 2013 at 18:48 UTC
|
Given the file is large, you can try using Sort::External supplied with the following sortsub: sub { $index{$Sort::External::a} <=> $index{$Sort::External::b} }
Where %index is a hash with letters from the vietnamese alphabet as its keys and their corresponding positions in the alphabet as its values. | [reply] [d/l] |
|
Thanks - could you post a small code example? I've not quite sure how to create and reference the index.
| [reply] |
|
moritz basically implemented a similar idea (that I formulated incorrectly in my message, btw). I think it will be better to use the unicode collation module advised above.
#!/usr/bin/env perl
use warnings;
use strict;
use Sort::External;
use Unicode::Collate::Locale;
my $in = shift // 'large-unsorted.txt';
my $out = shift // 'sorted.txt';
my $comparator = Unicode::Collate::Locale->new(locale =>'vi');
my $sorter = Sort::External->new
(
sortsub => sub { $comparator->cmp($Sort::External::a, $Sort::Externa
+l::b) }
);
open my $unsorted, '<', $in or die $!;
$sorter->feed($_) while <$unsorted>;
$sorter->finish(outfile => $out);
| [reply] [d/l] |
Re: Sorting Vietnamese text
by karlgoethebier (Abbot) on Dec 22, 2013 at 18:20 UTC
|
"...Vietnamese text file that I would like to sort..."
Perhaps you can show (something from) the text file as well as an intuitive example what you expect/like to do...?
Regards, Karl
«The Crux of the Biscuit is the Apostrophe»
| [reply] |
|
| [reply] |
|
| [reply] |
|
|
|
Re: Sorting Vietnamese text
by taint (Chaplain) on Dec 22, 2013 at 20:49 UTC
|
| [reply] |
Re: Sorting Vietnamese text
by Anonymous Monk on Dec 23, 2013 at 06:20 UTC
|
Note: take what I say here with a grain of salt since I know no Vietnamese.
Here's the Vietnamese alphabet sort order. And here's how to read that chart:
- First column (darkest colour) has the letter in question
- The other columns have the glyphs that sort under that letter
- Therefore, ấ and Ầ and ậ sort under â (will be found in the dictionary under the heading 'â')
- In the case where the two words are otherwise 100% equivalent (except for the diacritics), sort in the left-to-right order given in the chart.
Here's how I handled Japanese sorting (hiragana only) based on a similar chart for Japanese:
| [reply] |
|
And here's what I came up
sub make_sort_order {
my $str = shift;
$str =~
tr(aáàảãạăaáàảãạăắằẳẵặâấầẩẫậbcdđeéèẻẽẹêếềểễệfghiíìỉĩịjklmnoóòỏõọôốồổỗộơớờởỡợpqrstuúùủũụưứừửữựvwxyýỳỷỹỵz)
(00000011111111111112222223456777777888888abcddddddefghijjjjjjkkkkkkllllllmnopqrrrrrrsssssstuvwwwwwwx)d;
return $str;
}
my @words = ('ầm', 'ãm', 'ấm chè', 'ám số');
print $_->[1], "[n" for
sort { $a->[0] cmp $b->[0] || $a->[1] cmp $b->[1] }
map { [ make_sort_order($_), $_ ] } @words;
It's still missing a correct 'secondary sort' (for the edge case when the diacritic-stripped words are identical); it should not be difficult to add once someone figures out a suitable transliteration that sorts asciibetically. | [reply] |
|
It's still missing a correct 'secondary sort' (for the edge case when the diacritic-stripped words are identical);
(laughs) That's hardly an "edge case" in Vietnamese - there are thousands of minimal pairs where the only difference between the words is the diacritical marks. While it's possible to read and understand Vietnamese typed in (7-bit) ASCII without too much ambiguity (i.e. you can figure out what word is meant from the context), this obviously wouldn't work for a dictionary.
The other issue is that the words in the dictionary need to be sorted in the "correct" order for me to detect duplicates, etc.
I'll try out your suggestion later today - thanks again! | [reply] |
|
|
Re: Sorting Vietnamese text
by pdenisowski (Acolyte) on Dec 27, 2013 at 00:50 UTC
|
Thanks again to all for the replies -- they've been very helpful.
My Vietnamese-English dictionary project went live on Christmas, here's a link:
http://www.denisowski.org/Vietnamese/Vietnamese.html
This is the 52,000+ line UTF8 text file I'm trying to sort the way I described. :)
Thanks again!
Paul | [reply] |
|
I am trying to do this very thing with a Vietnamese text. What finally worked?
| [reply] |