Removing Foreign Characters

existem has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Removing Foreign Characters by dragonchild (Archbishop) on Jan 27, 2005 at 15:27 UTC
Unicode::Map8 Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply]
Re: Removing Foreign Characters by mirod (Canon) on Jan 27, 2005 at 15:31 UTC
I think Text::Unidecode would do what you want (see also The Perl Advent Calendar).	[reply]
Re: Removing Foreign Characters by borisz (Canon) on Jan 27, 2005 at 15:31 UTC
use Encode; perldoc Encode See: encode, decode, from_to Boris	[reply]
Re: Removing Foreign Characters by kutsu (Priest) on Jan 27, 2005 at 15:40 UTC
If all your looking for is to convert UTF-8 to Latin-1 then check out Converting character encodings and check out the Encode module. I have only used Encode to change shift-jis to UTF-8 and back...I don't know how it will work for changing UTF-8 to latin-1 for characters such as 苦痛, or kutsuu. "Cogito cogito ergo cogito sum - I think that I think, therefore I think that I am." Ambrose Bierce	[reply]
Re: Removing Foreign Characters by g0n (Priest) on Jan 27, 2005 at 16:25 UTC
If the idea is to convert the e acute, e grave, u umlaut etc to unaccented e, unaccented e, unaccented u respectively, I've looked for this before. I ended up with a large hash table in a subroutine - I don't think there's a standard module to do it. VGhpcyBtZXNzYWdlIGludGVudGlvbmFsbHkgcG9pbnRsZXNz	[reply]
Re^2: Removing Foreign Characters by existem (Sexton) on Jan 27, 2005 at 16:49 UTC
this is exactly what i'm trying to do, at the moment I just have funny characters, when all I want is English... perhaps I will have to just translate them manually as well, all the other Encode stuff seems very confusing!	[reply]
Re^3: Removing Foreign Characters by graff (Chancellor) on Jan 28, 2005 at 06:17 UTC
Here's a little script I cooked up not long ago to "deaccent" letters -- you need Perl version 5.8.0 or later to run it, and it assumes that your input text (from STDIN or file(s) named on the command line) is in utf-8: `#!/usr/bin/perl -CDS use strict; require 5.008; my @charnames = grep /\tLATIN \S+ LETTER/, split( /^/, do 'unicore/Nam +e.pl' ); my %accents; for my $c ( split //, qq/AEIOUCNYaeioucny/ ) { my $case = ( $c eq lc $c ) ? 'SMALL' : 'CAPITAL'; $accents{$c} = join( '', map { chr hex( substr $_, 0, 4 ) } grep /\tLATIN $case LETTER \U$c WITH/, @charnames ); } # now use each element of %accents as a character class: while (<>) { for my $c ( keys %accents ) { s/[$accents{$c}]/$c/g; } print; }` [download] If your original text is not utf8, well, you have to know what the encoding really is; then you can either find a way to convert to utf8 (e.g. there's an "iconv" tool on many systems, or you can use the Encode module in perl, which isn't that tough, really), OR you can hard-code all those conversions by hand instead of using the script shown above. Based on one of your replies, you would be happy with converting the accented characters to symbolic entity references (á and so on). I think your hard-coded hash is as good a solution as any for that, so long as the encoding you used to write the the perl code matches the encoding of your text data.	[reply] [d/l]
Re^3: Removing Foreign Characters by g0n (Priest) on Jan 27, 2005 at 17:17 UTC
The other solutions are to solve character encoding issues: you can have different binary sequences to mean the same character. For example: e acute might be one binary sequence in latin1, and a differnt binary sequence in UTF8 (and is, in fact). The problem with what you are trying to do, is that it is not translating between different representations of the same character (what people immediately think of) - you want to translate one character (e acute) into a totally different one (e no acute). I have some code to do this, but sadly not with me. I could post or mail it at the weekend. c. VGhpcyBtZXNzYWdlIGludGVudGlvbmFsbHkgcG9pbnRsZXNz	[reply]
Re^4: Removing Foreign Characters by existem (Sexton) on Jan 27, 2005 at 18:05 UTC
Re^4: Removing Foreign Characters by g0n (Priest) on Jan 28, 2005 at 10:43 UTC
Re: Removing Foreign Characters by g0n (Priest) on Apr 15, 2005 at 12:38 UTC
Never let it be said that the monastery is not responsive to the needs of the community, however slowly: Text::StripAccents g0n, backpropagated monk	[reply]


Welcome to the Monastery
	PerlMonks