I am not sure where this node fits exactly. Too shallow to be a tutorial, but I think the answer is a nice reference to others interested in the topic.

How to unaccent text?

Many Western languages use the Roman alphabet (a, b, ..., z) with a few or a bunch of diacritical marks. These diacritical marks are often called accents (like in French événement, Portuguese açaí, Spanish vicuña). Well, there are more in languages like Polish, Czech, Serbian which uses even more diacritics and digraphs, but I don't have much to say about that.

The Roman alphabet is particularly favoured in computer world, because it is part of the ASCII character set. There are lots of extensions to make more representative character sets while keeping some compatibility with ASCII. But sometimes you will be interested in downgrading your fancy strings to plain old [\0-\x7F] character range. It could be because your boss don't rely on modern technology, compatibility issues, easiness of accent-insensitive comparison, etc.

In Perl, how do you unaccent text?

Text::Unaccent

A very good thing about this module is its name — it is very obvious that it fits the task in hand.

Text::Unaccent distribution requires compilation to be installed because it uses an XS component and has a dependency on the iconv library.

It supports multiple character sets used like this:

use Text::Unaccent;
$unaccented = unac_string($charset, $text);
[download]

where $charset is something like "iso-8859-1", "utf-8", etc.

Text::Unidecode

Unlike the previous module, the purpose of Text::Unidecode is not to remove accents from a string. It has a broader objective to provide ASCII transliterations of Unicode text.

For example, it may convert "\x{5317}\x{4EB0}" (Chinese characters for Beijing) to "Bei Jing". But what is interesting here is that transliterations of Roman characters with accents are usually the naked Roman characters (as we wanted).

The module lives in a pure Perl distribution (which makes it very portable and immediate to install).

To use, it is just as easy as Text::Unaccent:

use utf8;
use Text::Unidecode;
$unaccented = unidecode($text);
[download]

No character set argument because you must use utf8 strings as inputs.

Read about the module rationale and shortcomings in its documentation.

Text::StripAccents

This module is very very lightweight, but restricted. It is just what you want if you're dealing only with Latin-1 strings.

use Text::StripAccents;
$unaccented = stripaccents($text);
[download]

(And there is an OO API as well.) This is also a pure Perl distribution.

Acknowledgments

Thanks to rhesa, Corion and Syphilis who helped me out when I was looking for alternatives for Text::Unaccent and inspired me to write these notes for others to review.

Back to Meditations