Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

RFC: How to unaccent text?

by ferreira (Chaplain)
on Apr 10, 2007 at 14:56 UTC ( [id://609166]=perlmeditation: print w/replies, xml ) Need Help??

I am not sure where this node fits exactly. Too shallow to be a tutorial, but I think the answer is a nice reference to others interested in the topic.

How to unaccent text?

Many Western languages use the Roman alphabet (a, b, ..., z) with a few or a bunch of diacritical marks. These diacritical marks are often called accents (like in French événement, Portuguese açaí, Spanish vicuña). Well, there are more in languages like Polish, Czech, Serbian which uses even more diacritics and digraphs, but I don't have much to say about that.

The Roman alphabet is particularly favoured in computer world, because it is part of the ASCII character set. There are lots of extensions to make more representative character sets while keeping some compatibility with ASCII. But sometimes you will be interested in downgrading your fancy strings to plain old [\0-\x7F] character range. It could be because your boss don't rely on modern technology, compatibility issues, easiness of accent-insensitive comparison, etc.

In Perl, how do you unaccent text?

Text::Unaccent

A very good thing about this module is its name — it is very obvious that it fits the task in hand.

Text::Unaccent distribution requires compilation to be installed because it uses an XS component and has a dependency on the iconv library.

It supports multiple character sets used like this:

use Text::Unaccent; $unaccented = unac_string($charset, $text);

where $charset is something like "iso-8859-1", "utf-8", etc.

Text::Unidecode

Unlike the previous module, the purpose of Text::Unidecode is not to remove accents from a string. It has a broader objective to provide ASCII transliterations of Unicode text.

For example, it may convert "\x{5317}\x{4EB0}" (Chinese characters for Beijing) to "Bei Jing". But what is interesting here is that transliterations of Roman characters with accents are usually the naked Roman characters (as we wanted).

The module lives in a pure Perl distribution (which makes it very portable and immediate to install).

To use, it is just as easy as Text::Unaccent:

use utf8; use Text::Unidecode; $unaccented = unidecode($text);

No character set argument because you must use utf8 strings as inputs.

Read about the module rationale and shortcomings in its documentation.

Text::StripAccents

This module is very very lightweight, but restricted. It is just what you want if you're dealing only with Latin-1 strings.

use Text::StripAccents; $unaccented = stripaccents($text);

(And there is an OO API as well.) This is also a pure Perl distribution.

Acknowledgments

Thanks to rhesa, Corion and Syphilis who helped me out when I was looking for alternatives for Text::Unaccent and inspired me to write these notes for others to review.

Replies are listed 'Best First'.
Re: RFC: How to unaccent text?
by salva (Canon) on Apr 11, 2007 at 08:53 UTC
    Looking at Text::StripAccents source code, it seems quite inefficient: it splits the string in chars, loops over them replacing accented ones by their ASCII equivalent and then joins the string again.

    A regular expresion substitution could do it:

    my %table = ( 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => ' +A', 'Å' => 'A', 'Ç' => 'C', 'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I', 'Ï' => 'I', 'Ñ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O', 'Ö' => ' +O', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'U', 'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a', 'ä' => ' +a', 'å' => 'a', 'ç' => 'c', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ñ' => 'n', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => ' +o', 'ß' => 'ss', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u', 'ý' => 'y' ); sub strip_accents { my $str = shift; $str =~ s/([^\x00-\x7F])/$table{$1} || '?'/ge; $str }
    It's so simple that it makes me think if a module is actually required...

    And BTW, "unaccenting" chars is not a unique transformation, it depends on the text language. For instance, in German 'ü' should be mapped to 'ue' (see Lingua::DE::ASCII), but in Spanish it should be mapped to 'u'.

      Looking at Text::StripAccents source code, it seems quite inefficient: it splits the string in chars, loops over them replacing accented ones by their ASCII equivalent and then joins the string again.
      Ouch. That sounds to me like it could be improved, and probably without changing the API. So, it could be better in a next version... (if somebody lends the author a hand. It could be you.)
      It's so simple that it makes me think if a module is actually required...
      What about the datatable... Are you going to construct it by hand, every time? Or will you be using copy-and-paste?

      Make it a module, it's the perfect place for it.

      p.s. I suppose tr/// would be a lot more efficient than s///, at least for single character replacements. You might benchmark it, to compare.

        What about the datatable... Are you going to construct it by hand, every time? Or will you be using copy-and-paste?

        Well, as I pointed in my previous reply, the transformation is not unique, there could be several variations, and including the table in the code is an easy way to ensure that the right one is used.

        For instance, Text::StripAccents converts 'ß' to 'ss', something unexpected for an spanish user like me.

        IMO, the right solution would be to create a set of language dependent modules similar to Lingua::DE::ASCII.

Re: RFC: How to unaccent text?
by g0n (Priest) on Apr 11, 2007 at 09:55 UTC
    Two comments about Text::StripAccents:

    • at the time that I uploaded it to CPAN there was nothing else that did the job (well, it was about the same time as another module, as I discovered later), it was something I needed to do fairly frequently, and I had the simple code lying around.
    • I've learned quite a bit since then, both about coding and the standards that are expected of CPAN modules.
    Since it appears to have been in some way at least slightly useful - if only because it doesn't require a compiler, I'll revisit it and make it more efficient and better documented.

    --------------------------------------------------------------

    "If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing."
    John Brunner, "The Shockwave Rider".

Re: RFC: How to unaccent text?
by wfsp (Abbot) on Apr 11, 2007 at 11:17 UTC
Re: RFC: How to unaccent text?
by bsb (Priest) on Apr 17, 2007 at 01:34 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://609166]
Approved by blokhead
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (9)
As of 2024-04-23 12:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found