Normalizing diacritics in (regex) search

LanX has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Normalizing diacritics in (regex) search by hippo (Archbishop) on Nov 24, 2025 at 12:22 UTC
Have you tried Text::Unidecode? 🦛	[reply]
Re^2: Normalizing diacritics in (regex) search by Corion (Patriarch) on Nov 24, 2025 at 12:53 UTC
I'm also very fond of Text::Unidecode, but it does slightly more. It also transliterates some non-Latin script into Latin, and it transliterates German umlauts to their German equivalents, like `ä` to `ae`. But for a quick first stab, using Text::Unidecode does 90% of what one wants.	[reply] [d/l] [select]
Re^3: Normalizing diacritics in (regex) search by LanX (Saint) on Nov 25, 2025 at 03:22 UTC
> and it transliterates German umlauts to their German equivalents, like ä to ae. Ähm ... actually,no! It's very vocal that you have to do it by yourself https://metacpan.org/pod/Text%3A%3AUnidecode#WHEN-YOU-DON'T-LIKE-WHAT-UNIDECODE-DOES Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply]
Re^2: Normalizing diacritics in (regex) search by LanX (Saint) on Nov 25, 2025 at 04:10 UTC
As Corion said, it does a lot more. Probably too much for my use case. And it's implemented by having many translation tables which are (manually?) maintained by the author. The last version is from 2016. And I'd rather use unicode properties directly to always stay up to date. last but not least, it doesn't provide me equivalent classes for specific latin characters. Just one function `unidecode` to "flatten" all input to latin characters if possible. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l]
Re^3: Normalizing diacritics in (regex) search by hippo (Archbishop) on Nov 25, 2025 at 10:41 UTC
last but not least, it doesn't provide me equivalent classes for specific latin characters. Just one function unidecode to "flatten" all input to latin characters if possible. Sorry, in that case I have misunderstood your requirements as I took it that this "flattening" is what you were after when you said "Of course I could do the normalization manually and map à á ä å ... -> a and so on." - never mind. 🦛	[reply]
Re^4: Normalizing diacritics in (regex) search by LanX (Saint) on Nov 25, 2025 at 14:00 UTC
Re: Normalizing diacritics in (regex) search by Anonymous Monk on Nov 24, 2025 at 14:08 UTC
Have you considered Unicode::Collate? A more hacky way to go might be to use Unicode::Normalize to convert the string to NFD or NFKD, then use `s///` to strip off the diacriticals. I call this hacky because this only handles diacriticals: it will not, for example, make a LATIN CAPITAL LETTER O WITH STROKE into a LATIN CAPITAL LETTER O.	[reply] [d/l]
Re^2: Normalizing diacritics in (regex) search by LanX (Saint) on Nov 25, 2025 at 03:34 UTC
actually I stumbled over this code from Brian d Foy, it shows a (well actually two) ways to parse the name the keyword `\bWITH\b` `use utf8; use v5.32; use open qw(:std :utf8); no warnings qw(experimental::uniprop_wildcards); use charnames qw(); my @letters = qw(a à á â ã ä å); foreach my $letter ( @letters ) { my $name = charnames::viacode( ord $letter ); say "$letter ($name):", $letter =~ m<\p{Name=/\bWITH\b/}> ? 'Matched' : 'Missed'; }` [download] In the next step I want to speed this up by preparing the mapping list for all latin characters beforehand, like this I can use simple character classes in the regexes. I was expecting to find a custom function which gives me these equivalent characters right away, probably prop_invmap of Unicode::UCD can be used for this. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: Normalizing diacritics in (regex) search by ikegami (Patriarch) on Nov 25, 2025 at 18:14 UTC
Indeed, for string comparisons and substring searches, Unicode::Collate is the way to go. `use v5.40; use utf8; use open ':std', ':encoding(UTF-8)'; use Unicode::Collate qw( ); sub f { $_[0] < 0 ? "lt" : $_[0] > 0 ? "gt" : "eq" } my $s1 = "voilà"; # Canonical spelling my $s2 = "voila"; # Alternative spelling my $collator = Unicode::Collate->new( ignore_level2 => true ); my $cmp = $collator->cmp( $s1, $s2 ); say "$s1 ".( f( $collator->cmp( $s1, $s2 ) ) )." $s2";` [download] `voilà eq voila` [download] Unfortunately for the OP, while Perl's regex engine can ignore case, it doesn't support ignoring diacritics.	[reply] [d/l] [select]
Re^2: Normalizing diacritics in (regex) search by LanX (Saint) on Nov 25, 2025 at 15:47 UTC
> Have you considered Unicode::Collate ? I stumbled over it, but I'm not sure how to use it in this case. Do you have a short demo? (it seems to have simplified versions of m// and s///, but without full regex syntax) Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply]
Re^3: Normalizing diacritics in (regex) search by Anonymous Monk on Nov 26, 2025 at 19:35 UTC
Sorry for the delay in response, but there have been distractions, I had trouble finding the code, and then dithered over whether to just put it somewhere public and point to it (I decided not to). Assuming `$in` has been properly decoded, I was proposing something like the following: use Unicode::Normalize qw{ NFKD }; ... my $out = NFKD( $in ); $out =~ s/ \p{NonspacingMark}+ //smxg; Note that this does not handle anything but diacritics. The above will change 'Köln' to 'Koln', but 'Øslo' (if it were really spelled that way) remains 'Øslo', because Unicode does not consider the stroke to be a diacritic. I think that for comparing things Unicode::Collate is actually the way to go	[reply] [d/l]
Re^4: Normalizing diacritics in (regex) search by LanX (Saint) on Nov 26, 2025 at 21:02 UTC
Re^4: Normalizing diacritics in (regex) search by LanX (Saint) on Nov 27, 2025 at 13:16 UTC
Re: Normalizing diacritics in (regex) search by jo37 (Curate) on Nov 25, 2025 at 16:55 UTC
You might convert to "fold case", perform "canonical decomposition" and drop all non-graphem-base characters, like `use Unicode::Normalize; say NFD(fc) =~ s/\P{GrBase}//gr for @ARGV` [download] `'Älsdjfßüsd' -> 'alsdjfssusd'` [download] Greetings, 🐻 `$gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$`	[reply] [d/l] [select]
Re^2: Normalizing diacritics in (regex) search by LanX (Saint) on Nov 25, 2025 at 23:26 UTC
Clever! Not sure if it's always safe with pathological characters, but nice idea! =) Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply]
Re: Normalizing diacritics in (regex) search by dsheroh (Monsignor) on Nov 27, 2025 at 07:44 UTC
I use Text::Unaccent::PurePerl for this sort of thing and am quite happy with it. The documentation is also rather clear about what exactly it does, so you can see whether it matches what you want or not. While looking up those docs just now, I also stumbled across Text::Transliterator::Unaccent which appears to be a configurable solution using unicode attributes, but I haven't used it, so I can't comment on how well it actually works.	[reply]

UPDATES