Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^2: Sorting Vietnamese text

by pdenisowski (Acolyte)
on Dec 22, 2013 at 20:07 UTC ( [id://1068114]=note: print w/replies, xml ) Need Help??


in reply to Re: Sorting Vietnamese text
in thread Sorting Vietnamese text

Giving the output:

unsorted á ả ã à ậ ă ạ ẫ a ẩ

sorted a à ả ã á ạ ă ẩ ẫ ậ

Thanks, but there are two issues : (1) that's still not the correct sort order (á should come before à), and (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.

Replies are listed 'Best First'.
Re^3: Sorting Vietnamese text
by farang (Chaplain) on Dec 22, 2013 at 23:48 UTC

    (1) that's still not the correct sort order (á should come before à)
    I've no idea, but this page indicates the opposite. You may have to create a custom sort to override the default if desired.
    (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.
    Perhaps it has to do with normalization. I still get the same sort order when using it.
    #!/usr/bin/env perl
    use v5.14;
    use warnings;
    use utf8::all;
    use Unicode::Collate::Locale;
    use Unicode::Normalize;
    
    my $Collator = Unicode::Collate::Locale->new(locale =>'vi');
    my @unsorted = ('á', 'ả', 'ã', 'à', 'ậ', 'ă', 'ạ', 'ẫ', 'a', 'ẩ' );
    @unsorted = map { NFD($_) } @unsorted;
    my @sorted = $Collator->sort(@unsorted);
    
    say NFC("unsorted\n@unsorted");
    say NFC("sorted\n@sorted");
    

      Sorry, still doesn't work -- see below. For example, I would expect "ỳ " (note space) to come before "ỳ ạch"

      This is what I've been struggling with for a LONG time :)

      unsorted
       ỷ : (1) to be fat (said of a pig); (2) to depend on
       ỳ : inertia, state of inactivity, stay out, inert, sluggish
       ỳ ạch : to toil, labor with difficulty
       ỷ eo : reproach someone with something
       ỷ lại : to depend, rely on others
       ỷ thế : count on one’s power, one’s position, one’s influence
       yêu nhau : to love each other, be in love
       yêu quí : precious, valuable
      

      sorted
       ỷ : (1) to be fat (said of a pig); (2) to depend on
       ỳ ạch : to toil, labor with difficulty
       ỷ eo : reproach someone with something
       yêu nhau : to love each other, be in love
       yêu quí : precious, valuable
       ỳ : inertia, state of inactivity, stay out, inert, sluggish
       ỷ lại : to depend, rely on others
       ỷ thế : count on one’s power, one’s position, one’s influence

        sorted
         ỷ : (1) to be fat (said of a pig); (2) to depend on
         ỳ ạch : to toil, labor with difficulty
         ỷ eo : reproach someone with something
         yêu nhau : to love each other, be in love
         yêu quí : precious, valuable
         ỳ : inertia, state of inactivity, stay out, inert, sluggish
         ỷ lại : to depend, rely on others
         ỷ thế : count on one’s power, one’s position, one’s influence
        
        Okay, I also get that output when using the entire lines as written. However, cutting those lines short at or before the colon ':' gives this.
        sorted
        ỳ :
        ỷ :
        ỳ ạch :
        ỷ eo :
        yêu nhau :
        yêu quí :
        ỷ lại :
        ỷ thế :
        
        What seems to be going on is that due to the complicated rules for ordering in Vietnamese based on syllables, having the English translation after the Vietnamese is messing up the sorting.

        I'd suggest trying to separate them into a hash if possible (split on the colon, maybe) so the sort can be based only on the Vietnamese.

        Are you absolutely certain your text is Unicode (UTF-8)? It's not TCVN (CP1258, ISO-2022-VN or EUC-VN), is it?

        I'm sorry if this is an "Is the power cord plugged in?" kind of question, but it just doesn't make sense that you're getting different output than farang got.

        Jim

      Thanks! I'll give it a try.

      When I learned Vietnamese, the order of the tones in every dictionary (all my older ones) was

      a á à ả ã ạ

      Some of my newer dictionaries use the order you mention in above, but after twenty years of doing it one way, it's a little hard to change :)

      There are also some differences in how initial consonant clusters are handled : does "thu" come before "tu" (in my older dictionaries "th" and "tr" are considered single "letters", kind of like c and ch in Spanish. I figured I would let this slide for now ...

      Thanks again!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1068114]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2024-04-24 15:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found