Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: RFC: How to unaccent text?

by salva (Canon)
on Apr 11, 2007 at 08:53 UTC ( [id://609319]=note: print w/replies, xml ) Need Help??


in reply to RFC: How to unaccent text?

Looking at Text::StripAccents source code, it seems quite inefficient: it splits the string in chars, loops over them replacing accented ones by their ASCII equivalent and then joins the string again.

A regular expresion substitution could do it:

my %table = ( 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => ' +A', 'Å' => 'A', 'Ç' => 'C', 'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I', 'Ï' => 'I', 'Ñ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O', 'Ö' => ' +O', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'U', 'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a', 'ä' => ' +a', 'å' => 'a', 'ç' => 'c', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ñ' => 'n', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => ' +o', 'ß' => 'ss', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u', 'ý' => 'y' ); sub strip_accents { my $str = shift; $str =~ s/([^\x00-\x7F])/$table{$1} || '?'/ge; $str }
It's so simple that it makes me think if a module is actually required...

And BTW, "unaccenting" chars is not a unique transformation, it depends on the text language. For instance, in German 'ü' should be mapped to 'ue' (see Lingua::DE::ASCII), but in Spanish it should be mapped to 'u'.

Replies are listed 'Best First'.
Re^2: RFC: How to unaccent text?
by bart (Canon) on Apr 11, 2007 at 10:32 UTC
    Looking at Text::StripAccents source code, it seems quite inefficient: it splits the string in chars, loops over them replacing accented ones by their ASCII equivalent and then joins the string again.
    Ouch. That sounds to me like it could be improved, and probably without changing the API. So, it could be better in a next version... (if somebody lends the author a hand. It could be you.)
    It's so simple that it makes me think if a module is actually required...
    What about the datatable... Are you going to construct it by hand, every time? Or will you be using copy-and-paste?

    Make it a module, it's the perfect place for it.

    p.s. I suppose tr/// would be a lot more efficient than s///, at least for single character replacements. You might benchmark it, to compare.

      What about the datatable... Are you going to construct it by hand, every time? Or will you be using copy-and-paste?

      Well, as I pointed in my previous reply, the transformation is not unique, there could be several variations, and including the table in the code is an easy way to ensure that the right one is used.

      For instance, Text::StripAccents converts 'ß' to 'ss', something unexpected for an spanish user like me.

      IMO, the right solution would be to create a set of language dependent modules similar to Lingua::DE::ASCII.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://609319]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-03-29 09:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found