in reply to Re^3: Normalizing diacritics in (regex) search
in thread Normalizing diacritics in (regex) search
But jo37's approach with NFD is IMHO better because of the "dangers of pathological characters" I mentioned...
Consider U+3374 ㍴: NFKD will decompose it to "bar", NFD won't. That means a symbol/character "㍴" might match in "Barbra Streisand". So if eliminating diacritics is the goal, NFD is preferable.
°) For completeness: There is an example in Unicode::Collate, demonstrating normalized search with a (broken°) German phrase
and get "muß" in $match, since "muß" is primary equal to "MüSS".my $Collator = Unicode::Collate->new( normalization => undef, level => + 1 ); # (normalization => undef) is REQUI +RED. my $match; if (my($pos,$len) = $Collator->index($str, $sub)) { $match = substr($str, $pos, $len); }
Alas I didn't "study" this module sufficiently to tell if this is exactly matching my requirements to only ignore diacritics.
Cheers Rolf
(addicted to the Perl Programming Language :)
see Wikisyntax for the Monastery
°) Ha :) ... you can almost hear an English accent with this word order, OTOH I suppose it's easier to decipher for English speakers than "Ich muss Perl studieren". (Which is still slightly off, "lernen" would be better in this case)
|
|---|