in reply to Normalizing diacritics in (regex) search

Have you tried Text::Unidecode?


🦛

  • Comment on Re: Normalizing diacritics in (regex) search

Replies are listed 'Best First'.
Re^2: Normalizing diacritics in (regex) search
by Corion (Patriarch) on Nov 24, 2025 at 12:53 UTC

    I'm also very fond of Text::Unidecode, but it does slightly more. It also transliterates some non-Latin script into Latin, and it transliterates German umlauts to their German equivalents, like ä to ae.

    But for a quick first stab, using Text::Unidecode does 90% of what one wants.

Re^2: Normalizing diacritics in (regex) search
by LanX (Saint) on Nov 25, 2025 at 04:10 UTC
    As Corion said, it does a lot more. Probably too much for my use case.

    And it's implemented by having many translation tables which are (manually?) maintained by the author. The last version is from 2016.

    And I'd rather use unicode properties directly to always stay up to date.

    last but not least, it doesn't provide me equivalent classes for specific latin characters. Just one function unidecode to "flatten" all input to latin characters if possible.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

      last but not least, it doesn't provide me equivalent classes for specific latin characters. Just one function unidecode to "flatten" all input to latin characters if possible.

      Sorry, in that case I have misunderstood your requirements as I took it that this "flattening" is what you were after when you said "Of course I could do the normalization manually and map à á ä å ... -> a and so on." - never mind.


      🦛

        No! No need to apologize, I was asking for input.

        You just asked if I tried that module and I wanted to share my insights.*

        The unidecode mapping à á ä å ... -> a would force me to normalize all search data.

        The reverse a -> à á ä å allows to fix the search term. By replacing every a with a character class [àáäå] etc.

        Both approaches have their pro and cons, I prefer to have the choice. :)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

        *) reworded