Re: Normalizing diacritics in (regex) search

Replies are listed 'Best First'.
Re^2: Normalizing diacritics in (regex) search by LanX (Saint) on Nov 25, 2025 at 03:34 UTC
actually I stumbled over this code from Brian d Foy, it shows a (well actually two) ways to parse the name the keyword `\bWITH\b` `use utf8; use v5.32; use open qw(:std :utf8); no warnings qw(experimental::uniprop_wildcards); use charnames qw(); my @letters = qw(a à á â ã ä å); foreach my $letter ( @letters ) { my $name = charnames::viacode( ord $letter ); say "$letter ($name):", $letter =~ m<\p{Name=/\bWITH\b/}> ? 'Matched' : 'Missed'; }` [download] In the next step I want to speed this up by preparing the mapping list for all latin characters beforehand, like this I can use simple character classes in the regexes. I was expecting to find a custom function which gives me these equivalent characters right away, probably prop_invmap of Unicode::UCD can be used for this. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: Normalizing diacritics in (regex) search by ikegami (Patriarch) on Nov 25, 2025 at 18:14 UTC
Indeed, for string comparisons and substring searches, Unicode::Collate is the way to go. `use v5.40; use utf8; use open ':std', ':encoding(UTF-8)'; use Unicode::Collate qw( ); sub f { $_[0] < 0 ? "lt" : $_[0] > 0 ? "gt" : "eq" } my $s1 = "voilà"; # Canonical spelling my $s2 = "voila"; # Alternative spelling my $collator = Unicode::Collate->new( ignore_level2 => true ); my $cmp = $collator->cmp( $s1, $s2 ); say "$s1 ".( f( $collator->cmp( $s1, $s2 ) ) )." $s2";` [download] `voilà eq voila` [download] Unfortunately for the OP, while Perl's regex engine can ignore case, it doesn't support ignoring diacritics.	[reply] [d/l] [select]
Re^2: Normalizing diacritics in (regex) search by LanX (Saint) on Nov 25, 2025 at 15:47 UTC
> Have you considered Unicode::Collate ? I stumbled over it, but I'm not sure how to use it in this case. Do you have a short demo? (it seems to have simplified versions of m// and s///, but without full regex syntax) Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply]
Re^3: Normalizing diacritics in (regex) search by Anonymous Monk on Nov 26, 2025 at 19:35 UTC
Sorry for the delay in response, but there have been distractions, I had trouble finding the code, and then dithered over whether to just put it somewhere public and point to it (I decided not to). Assuming `$in` has been properly decoded, I was proposing something like the following: use Unicode::Normalize qw{ NFKD }; ... my $out = NFKD( $in ); $out =~ s/ \p{NonspacingMark}+ //smxg; Note that this does not handle anything but diacritics. The above will change 'Köln' to 'Koln', but 'Øslo' (if it were really spelled that way) remains 'Øslo', because Unicode does not consider the stroke to be a diacritic. I think that for comparing things Unicode::Collate is actually the way to go	[reply] [d/l]
Re^4: Normalizing diacritics in (regex) search by LanX (Saint) on Nov 26, 2025 at 21:02 UTC
Thanks, very similar to jo37's solution (and not really using Unicode::Collate like you suggested° ;-) But jo37's approach with NFD is IMHO better because of the "dangers of pathological characters" I mentioned... Consider U+3374 ㍴: NFKD will decompose it to "bar", NFD won't. That means a symbol/character "㍴" might match in "Barbra Streisand". So if eliminating diacritics is the goal, NFD is preferable. °) For completeness: There is an example in Unicode::Collate, demonstrating normalized search with a (broken°) German phrase e.g. when the content of $str is "Ich muß studieren Perl.", you say the following where $sub is "MüSS", `my $Collator = Unicode::Collate->new( normalization => undef, level => + 1 ); # (normalization => undef) is REQUI +RED. my $match; if (my($pos,$len) = $Collator->index($str, $sub)) { $match = substr($str, $pos, $len); }` [download] and get "muß" in $match, since "muß" is primary equal to "MüSS". Alas I didn't "study" this module sufficiently to tell if this is exactly matching my requirements to only ignore diacritics. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery} °) Ha :) ... you can almost hear an English accent with this word order, OTOH I suppose it's easier to decipher for English speakers than "Ich muss Perl studieren". (Which is still slightly off, "lernen" would be better in this case)	[reply] [d/l]
Re^4: Normalizing diacritics in (regex) search by LanX (Saint) on Nov 27, 2025 at 13:16 UTC
> Note that this does not handle anything but diacritics. The above will change `Köln` to `Koln', but 'Øslo` (if it were really spelled that way) remains 'Øslo', because Unicode does not consider the stroke to be a diacritic. weirdly enough, there are Combining Diacritical Marks listed for strokes ̷ `U+0337 ̷ 823 Combining Short Solidus Overlay` ̸ `U+0338 ̸ 824 Combining Long Solidus Overlay` but the effects are not the same ŵôr̂d̂ w̷o̷r̷d̷ w̸o̸r̸d̸ While listed as diacritics they seem only to be used for `<strike>` like negation. > but `Øslo` (if it were really spelled that way) It isn't, but you can take smørrebrød° for the `LATIN CAPITAL LETTER O WITH STROKE` :-) Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery} °) literally a smeared-bread	[reply] [d/l] [select]