pijush has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

In my programme I want to compare two unicode strings. In case of ASCII string it is too easy, but when I am trying to compare two unicode strings, then I am facing few issues.
Unicode defines two equivalence between characters: cannonical equivalence and compatibility equivalence.
I want to cover both equivalence when I comapare two unicode strings. Can anybody please tell me whether I use Unicode::Normalize or Unicode::Collate module to compare equivalency of two strings?
Thanks in advance.
Regards
-Pijush

  • Comment on Help needed to compare two unicode strings!!!

Replies are listed 'Best First'.
Re: Help needed to compare two unicode strings!!!
by hv (Prior) on Aug 30, 2004 at 16:49 UTC

    I'm no expert on Unicode, but I think this is precisely the problem that Unicode::Normalize sets out to solve.

    When you say you "want to cover both equivalence(s)", do you mean you want all such equivalences to compare equal? If so, based on a quick read of the slightly opaque documentation, I would guess you want to convert each string into either NFC or NFKC form: it isn't clear to me what the difference between the two is, but they both return a string formed by "compatibility decomposition followed by canonical composition".

    If you're not certain, I'd suggest devising some tests that characterise the equivalences you're trying to allow, and run them through something like:

    sub ucompare { my($s, $t) = @_; use Unicode::Normalize; print "NFC-equivalent" if NFC($s) eq NFC($t); print "NFKC-equivalent" if NFKC($s) eq NFKC($t); }

    Hope this helps,

    Hugo

      Thanks for your reply.
      Actually, I want to make unicode string comparasion robust. So I need to compare both Canonical equivalance as well as Compatibility equivalance. I can mention one example where these two equivalance makes difference.
      The half-width and full-width katakana characters have same compatibility equivalents, but they are not canonical equivalent.
      So, is it fine to compare canonical equivalance first and then compatibility equivalance?
      TIA
      -Pijush
        "So, is it fine to compare canonical equivalance first and then compatibility equivalance?"

        This is all about purpose. What is your purpose? You said that you wanted to make the comparason robust, but what is a "robust comparason" (this is not a concept defined in unicode standards, but rather a term you created to serve your own thought, which was not clearly expressed)

        In general, the canonical equivalancy is the basic equivalancy, and is most likely good enough for you.

        To compare both equivalancy really does not make the comparason robust. To me robust means not exposed to error or exposed to less errors, which does not make much sense here (both equivalancy has their own purpose, and none of them produces ERROR). Say it one more time, it is about your purpose, about the kind of equivalancy you want.

Re: Help needed to compare two unicode strings!!!
by iburrell (Chaplain) on Aug 30, 2004 at 16:46 UTC
    It looks like Unicode::Normalize is the one to use. It converts between the various normalization forms. For comparing strings, the best normalization form is probably NFC.
Re: Help needed to compare two unicode strings!!!
by pijush (Scribe) on Sep 01, 2004 at 06:53 UTC
    Hi Monks!!

    I have tried to execute following script to compare two UTF-8 encoded strings.

    use Unicode::Normalize; my $string1 = 'トウキョウ'; #Full width Katakana my $string2 = 'トウキョウ'; #Half width Katakana print "NFC-equivalent\n" if NFC($string1) eq NFC($string2); print "NFD-equivalent\n" if NFD($string1) eq NFD($string2); print "NFKD-equivalent\n" if NFKD($string1) eq NFKD($string2); print "NFKC-equivalent\n" if NFKC($string1) eq NFKC($string2); print "End\n";
    Please Select Japanese Shift-JIS encoding in browser to see the correct form of the string
    Although meaning of these two strings are same (Toukyou), but each comparison fails to detect these two strings are same.
    Can anybody please help me how can I compare these two strings?
    Thanks in advance.
    Regards
    -Pijush