I am not sure where this node fits exactly. Too shallow to be a tutorial, but I think the answer is a nice reference to others interested in the topic.

How to unaccent text?

Many Western languages use the Roman alphabet (a, b, ..., z) with a few or a bunch of diacritical marks. These diacritical marks are often called accents (like in French événement, Portuguese açaí, Spanish vicuña). Well, there are more in languages like Polish, Czech, Serbian which uses even more diacritics and digraphs, but I don't have much to say about that.

The Roman alphabet is particularly favoured in computer world, because it is part of the ASCII character set. There are lots of extensions to make more representative character sets while keeping some compatibility with ASCII. But sometimes you will be interested in downgrading your fancy strings to plain old [\0-\x7F] character range. It could be because your boss don't rely on modern technology, compatibility issues, easiness of accent-insensitive comparison, etc.

In Perl, how do you unaccent text?

Text::Unaccent

A very good thing about this module is its name — it is very obvious that it fits the task in hand.

Text::Unaccent distribution requires compilation to be installed because it uses an XS component and has a dependency on the iconv library.

It supports multiple character sets used like this:

use Text::Unaccent; $unaccented = unac_string($charset, $text);

where $charset is something like "iso-8859-1", "utf-8", etc.

Text::Unidecode

Unlike the previous module, the purpose of Text::Unidecode is not to remove accents from a string. It has a broader objective to provide ASCII transliterations of Unicode text.

For example, it may convert "\x{5317}\x{4EB0}" (Chinese characters for Beijing) to "Bei Jing". But what is interesting here is that transliterations of Roman characters with accents are usually the naked Roman characters (as we wanted).

The module lives in a pure Perl distribution (which makes it very portable and immediate to install).

To use, it is just as easy as Text::Unaccent:

use utf8; use Text::Unidecode; $unaccented = unidecode($text);

No character set argument because you must use utf8 strings as inputs.

Read about the module rationale and shortcomings in its documentation.

Text::StripAccents

This module is very very lightweight, but restricted. It is just what you want if you're dealing only with Latin-1 strings.

use Text::StripAccents; $unaccented = stripaccents($text);

(And there is an OO API as well.) This is also a pure Perl distribution.

Acknowledgments

Thanks to rhesa, Corion and Syphilis who helped me out when I was looking for alternatives for Text::Unaccent and inspired me to write these notes for others to review.


In reply to RFC: How to unaccent text? by ferreira

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.