Re: Help needed to compare two unicode strings!!!

I'm no expert on Unicode, but I think this is precisely the problem that Unicode::Normalize sets out to solve.

When you say you "want to cover both equivalence(s)", do you mean you want all such equivalences to compare equal? If so, based on a quick read of the slightly opaque documentation, I would guess you want to convert each string into either NFC or NFKC form: it isn't clear to me what the difference between the two is, but they both return a string formed by "compatibility decomposition followed by canonical composition".

If you're not certain, I'd suggest devising some tests that characterise the equivalences you're trying to allow, and run them through something like:

  sub ucompare {
    my($s, $t) = @_;
    use Unicode::Normalize;
    print "NFC-equivalent" if NFC($s) eq NFC($t);
    print "NFKC-equivalent" if NFKC($s) eq NFKC($t);
  }
[download]

Hope this helps,

Hugo

Comment on Re: Help needed to compare two unicode strings!!! Download Code

Replies are listed 'Best First'.
Re^2: Help needed to compare two unicode strings!!! by pijush (Scribe) on Aug 30, 2004 at 17:02 UTC
Thanks for your reply. Actually, I want to make unicode string comparasion robust. So I need to compare both Canonical equivalance as well as Compatibility equivalance. I can mention one example where these two equivalance makes difference. The half-width and full-width katakana* characters have same compatibility equivalents, but they are not canonical equivalent.* So, is it fine to compare canonical equivalance first and then compatibility equivalance? TIA -Pijush	[reply]
Re^3: Help needed to compare two unicode strings!!! by pg (Canon) on Aug 31, 2004 at 01:41 UTC
"So, is it fine to compare canonical equivalance first and then compatibility equivalance?" This is all about purpose. What is your purpose? You said that you wanted to make the comparason robust, but what is a "robust comparason" (this is not a concept defined in unicode standards, but rather a term you created to serve your own thought, which was not clearly expressed) In general, the canonical equivalancy is the basic equivalancy, and is most likely good enough for you. To compare both equivalancy really does not make the comparason robust. To me robust means not exposed to error or exposed to less errors, which does not make much sense here (both equivalancy has their own purpose, and none of them produces ERROR). Say it one more time, it is about your purpose, about the kind of equivalancy you want.	[reply]
Re^4: Help needed to compare two unicode strings!!! by pijush (Scribe) on Aug 31, 2004 at 04:37 UTC
Thanks for your opinion. I agree with you that canonical equivalancy is the basic equivalancy and most likely suitable for application. But if I want to build an application which will verify the user credentials against the credentials stroed in a directory server (say LDAP), in that case I need to compare two strings, one supplied by the user and one stored in the directory. The sting stored in the directory server supplied by an administrator of the application software. In case of string stored in the directory server is the UTF-8 string and the UTF-8 encoding has done on the string supplied by the administrator. This string stored in the directory server remains constant for every time I fetch from directory server. But credentials supplied by the user may vary. Some time user can choose one format and other time another format. Take the Katakana example which I have mentioned in my previous post. Administrator supplied the user credentials in katakana half width format and user supplied the same thing in katakana full width format. If I comapre these two strings in canonical equivalancy then these two strings are different and the application will fail to identify the user. I think this is an error. Can you please tell me in this case what shall I do, stick to canonical equivalancy or shall I check compatibility equivalancy as well? TIA. -Pijush	[reply]