Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I am not sure where this node fits exactly. Too shallow to be a tutorial, but I think the answer is a nice reference to others interested in the topic.

How to unaccent text?

Many Western languages use the Roman alphabet (a, b, ..., z) with a few or a bunch of diacritical marks. These diacritical marks are often called accents (like in French événement, Portuguese açaí, Spanish vicuña). Well, there are more in languages like Polish, Czech, Serbian which uses even more diacritics and digraphs, but I don't have much to say about that.

The Roman alphabet is particularly favoured in computer world, because it is part of the ASCII character set. There are lots of extensions to make more representative character sets while keeping some compatibility with ASCII. But sometimes you will be interested in downgrading your fancy strings to plain old [\0-\x7F] character range. It could be because your boss don't rely on modern technology, compatibility issues, easiness of accent-insensitive comparison, etc.

In Perl, how do you unaccent text?

Text::Unaccent

A very good thing about this module is its name — it is very obvious that it fits the task in hand.

Text::Unaccent distribution requires compilation to be installed because it uses an XS component and has a dependency on the iconv library.

It supports multiple character sets used like this:

use Text::Unaccent; $unaccented = unac_string($charset, $text);

where $charset is something like "iso-8859-1", "utf-8", etc.

Text::Unidecode

Unlike the previous module, the purpose of Text::Unidecode is not to remove accents from a string. It has a broader objective to provide ASCII transliterations of Unicode text.

For example, it may convert "\x{5317}\x{4EB0}" (Chinese characters for Beijing) to "Bei Jing". But what is interesting here is that transliterations of Roman characters with accents are usually the naked Roman characters (as we wanted).

The module lives in a pure Perl distribution (which makes it very portable and immediate to install).

To use, it is just as easy as Text::Unaccent:

use utf8; use Text::Unidecode; $unaccented = unidecode($text);

No character set argument because you must use utf8 strings as inputs.

Read about the module rationale and shortcomings in its documentation.

Text::StripAccents

This module is very very lightweight, but restricted. It is just what you want if you're dealing only with Latin-1 strings.

use Text::StripAccents; $unaccented = stripaccents($text);

(And there is an OO API as well.) This is also a pure Perl distribution.

Acknowledgments

Thanks to rhesa, Corion and Syphilis who helped me out when I was looking for alternatives for Text::Unaccent and inspired me to write these notes for others to review.


In reply to RFC: How to unaccent text? by ferreira

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2024-04-25 17:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found