Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Quick way to convert to ASCII

by kettle (Beadle)
on Jul 26, 2006 at 02:35 UTC ( [id://563687]=perlquestion: print w/replies, xml ) Need Help??

kettle has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I am looking for a quick and easy way to convert UTF-8 or LATIN-1 characters to their closest ASCII equivalent. Thus an accented 'e' should be mapped to a 'regular no frills' ASCII 'e', and similarly an 'A' with a tilde over it should be mapped to a standard uppercase 'A'. I can use individual hex codes and map characters with a host of regexes, but this seem s like overkill. Any clever thoughts would be appreciated!

Replies are listed 'Best First'.
Re: Quick way to convert to ASCII
by blokhead (Monsignor) on Jul 26, 2006 at 04:14 UTC
    Text::Unidecode looks like it does exactly that. It's pure Perl, but since it's essentially a giant lookup table for all of Unicode, it's not small (748k).

    blokhead

      It gets the ligature right and has a great motto :) :

      MOTTO

      The Text::Unidecode motto is:
      It's better than nothing!

      ...in both meanings: 1) seeing the output of unidecode(...) is better than just having all font-unavailable Unicode characters replaced with ``?'''s, or rendered as gibberish; and 2) it's the worst, i.e., there's nothing that Text::Unidecode's algorithm is better than.

      DWIM is Perl's answer to Gödel
Re: Quick way to convert to ASCII
by GrandFather (Saint) on Jul 26, 2006 at 03:10 UTC

    At the end of the day there has to be a lookup. That can be fairly quick using the translation function:

    use warnings; use strict; my $str = <<'STR'; Les naïfs ægithales hâtifs pondant à Noël où il gèle sont sûrs d'être +déçus et de voir leurs drôles d'œufs abîmés STR my %xlateL = ( a => 'âà', c => 'ç', e => 'èëéê', i => 'ïî', o => 'ô', u => 'ùû' #... ); my %xlateU; $xlateU{uc $_} = uc ($xlateL{$_}) for keys %xlateL; #Generate the uppe +r case versions eval "\$str =~ tr/$xlateL{$_}/$_/;" for keys %xlateL; eval "\$str =~ tr/$xlateU{$_}/$_/;" for keys %xlateU; print $str;

    Prints:

    Les naifs ægithales hatifs pondant a Noel ou il gele sont surs d'etre +decus et de voir leurs droles d'œufs abimes

    Note that æ causes a little grief however. Using a regex rather than the translation and a seperate set of tables is probably the fix for that.

    This would make a good CPAN module when you've got it done. :)


    DWIM is Perl's answer to Gödel
Re: Quick way to convert to ASCII
by ikegami (Patriarch) on Jul 26, 2006 at 03:04 UTC

      I notice Text::StripAccents at least (I didn't find Text::Unaccent using ppm) suffers the æ problem. No great surprise that something written to handle accents doesn't handle ligatures, but somewhat disapointing.


      DWIM is Perl's answer to Gödel
Re: Quick way to convert to ASCII
by Thelonius (Priest) on Jul 26, 2006 at 12:31 UTC
    I happened across a table for this just yesterday (with Greek and Cyrillic transliterations, too), so here's some Perl from that table:
    # in-place sub asciiize { $_[0] =~ s/([^\0-\x7f])/exists($asciiize{$1})?$asciiize{$1}:"?"/eg; return $_[0]; } # returns new sub giveascii { asciiize(my $x = shift); }

    Edited by planetscape - added readmore tags

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://563687]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2024-04-16 12:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found