comment on

Unicode::Map8 (you need Unicode::String too) also do conversions and they don't rely on iconv. This means that they are probably more portable, but likely slower than Text::Iconv. I usually use Text::Iconv.

You might find converting character encodings useful, it shows you various methods to convert utf8 characters to latin1.

Here is a version that does not use XML::Parser (adapting it to other encodings is left as a(n easy) exercice for the reader ;--):

#!/bin/perl -w
# converts XML data from UTF-8 back into latin1
# -r uses a regexp
# -u uses Unicode::Strings
# -i uses Text::Iconv (and the iconv library)

# Note: -r does not work properly with XML::Parser 2.30

use strict;

my $filter;

if(    $ARGV[0] eq '-r') { $filter = \&latin1;                  }
elsif( $ARGV[0] eq '-u') { $filter= unicode_convert( 'latin1'); }
elsif( $ARGV[0] eq '-i') { $filter= iconv_convert( 'latin1');   }
else { die "usage: $0 [-r|-u|-i]"; }

my $text= <DATA>;
chomp $text; 
print "$text => ", $filter->( $text), "\n"; 

# shamelessly lifted from XML::TyePYX
sub latin1 
  { my $text=shift;
    $text=~s{([\xc0-\xc3])(.)}{ my $hi = ord($1);
                                my $lo = ord($2);
                                chr((($hi & 0x03) <<6) | ($lo & 0x3F))
                              }ge;
    return $text;
  }

sub unicode_convert
  { my $enc= shift;
    require Unicode::Map8;
    require Unicode::String;
    import Unicode::String qw(utf8);
    my $sub= eval q{
            { my $cnv;
          sub { $cnv ||= new Unicode::Map8 ($enc) 
                  or die "Can't create converter";
            return  $cnv->to8 (utf8($_[0])->ucs2); 
              } 
        } };
    return $sub;
  }

sub iconv_convert
  { my $enc= shift;
    require Text::Iconv;
    my $sub= eval q{
            { my $cnv;
          sub { $cnv ||= new Text::Iconv( 'utf8', $enc) 
                  or die "Can't create converter";
            return  $cnv->convert( $_[0]); 
              } 
        } };
    return $sub;
  }

__DATA__
texte soupçonné d'être plein de caractÚres accentués
[download]

In reply to Re: Re: Re: Unicode and locales by mirod
in thread Unicode and locales by moxliukas

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


There's more than one way to do things
	PerlMonks