Re: Unicode and locales

Unfortunately Encode needs perl version 5.7.3 (at least that's what perl -MCPAN -e 'install Encode' told me)

I will try Text::Iconv and see if that works.

Oh, and I have found the module Unicode::Map8. After reading the docs I am still not sure if it can be relevant to what I am doing. Can anyone enlighten me?

I guess I'll have to upgrade to 5.8 on FreeBSD machine. It is probably high time to do it anyway ;)

[reply]
[d/l]

Unicode::Map8 (you need Unicode::String too) also do conversions and they don't rely on iconv. This means that they are probably more portable, but likely slower than Text::Iconv. I usually use Text::Iconv.

You might find converting character encodings useful, it shows you various methods to convert utf8 characters to latin1.

Here is a version that does not use XML::Parser (adapting it to other encodings is left as a(n easy) exercice for the reader ;--):

#!/bin/perl -w
# converts XML data from UTF-8 back into latin1
# -r uses a regexp
# -u uses Unicode::Strings
# -i uses Text::Iconv (and the iconv library)

# Note: -r does not work properly with XML::Parser 2.30

use strict;

my $filter;

if(    $ARGV[0] eq '-r') { $filter = \&latin1;                  }
elsif( $ARGV[0] eq '-u') { $filter= unicode_convert( 'latin1'); }
elsif( $ARGV[0] eq '-i') { $filter= iconv_convert( 'latin1');   }
else { die "usage: $0 [-r|-u|-i]"; }

my $text= <DATA>;
chomp $text; 
print "$text => ", $filter->( $text), "\n"; 

# shamelessly lifted from XML::TyePYX
sub latin1 
  { my $text=shift;
    $text=~s{([\xc0-\xc3])(.)}{ my $hi = ord($1);
                                my $lo = ord($2);
                                chr((($hi & 0x03) <<6) | ($lo & 0x3F))
                              }ge;
    return $text;
  }

sub unicode_convert
  { my $enc= shift;
    require Unicode::Map8;
    require Unicode::String;
    import Unicode::String qw(utf8);
    my $sub= eval q{
            { my $cnv;
          sub { $cnv ||= new Unicode::Map8 ($enc) 
                  or die "Can't create converter";
            return  $cnv->to8 (utf8($_[0])->ucs2); 
              } 
        } };
    return $sub;
  }

sub iconv_convert
  { my $enc= shift;
    require Text::Iconv;
    my $sub= eval q{
            { my $cnv;
          sub { $cnv ||= new Text::Iconv( 'utf8', $enc) 
                  or die "Can't create converter";
            return  $cnv->convert( $_[0]); 
              } 
        } };
    return $sub;
  }

__DATA__
texte soupÃ§onnÃ© d'Ãªtre plein de caractÃšres accentuÃ©s
[download]

[reply]
[d/l]

I think it's time for a benchmark here:

Using perl 5.8.0, on Linux (Mandrake 9.0) on a rather fast machine (Athlon dual-processor 1.8):

#!/bin/perl -w

use strict;
use Benchmark( 'cmpthese');

use Encode;
use Text::Iconv;
use Unicode::Map8;
use Unicode::String qw(utf8);

use utf8;

my $enc= 'latin1';

my $convert_iconv   = Text::Iconv->new( 'utf8', $enc);
my $convert_unicode = Unicode::Map8->new ($enc);

my $text= <DATA>;
chomp $text; 


# lets just check the output!
print "Encode        : ", encode("iso-8859-1", $text), "\n";
print "Text::Iconv   : ", $convert_iconv->convert(   $text), "\n";
print "Unicode::Map8 : ", $convert_unicode->to8 (utf8($text)->ucs2), "
+\n";
print "regexp        : ", latin1(  $text), "\n";

# now benchmark
cmpthese( 500000, {
               'Encode'        => sub { encode("iso-8859-1", $text);  
+             },
               'Text::Iconv'   => sub { $convert_iconv->convert( $text
+);           },
               'Unicode::Map8' => sub { $convert_unicode->to8 (utf8($t
+ext)->ucs2); },
               'regexp'        => sub { latin1( $text);               
+             },
           });


sub latin1 
  { my $text=shift;
    $text=~s{([\xc0-\xc3])(.)}{ my $hi = ord($1);
                                my $lo = ord($2);
                                chr((($hi & 0x03) <<6) | ($lo & 0x3F))
                              }ge;
    return $text;
  }


__DATA__
texte soupÃ§onnÃ© d'Ãªtre plein de caractÃšres accentuÃ©s
[download]

Results:

Encode        : texte soupçonné d'être plein de caractères accentués
Text::Iconv   : texte soupçonné d'être plein de caractères accentués
Unicode::Map8 : texte soupçonné d'être plein de caractères accentués
regexp        : texte soupçonné d'être plein de caractères accentués

Benchmark: timing 500000 iterations of Encode, Text::Iconv, Unicode::M
+ap8, regexp...
Encode:         6 wallclock secs ( 4.91 usr +  0.02 sys =  4.93 CPU) @
+ 101419.88/s (n=500000)
Text::Iconv:    2 wallclock secs ( 2.20 usr +  0.00 sys =  2.20 CPU) @
+ 227272.73/s (n=500000)
Unicode::Map8:  7 wallclock secs ( 7.66 usr +  0.00 sys =  7.66 CPU) @
+ 65274.15/s (n=500000)
regexp:         6 wallclock secs ( 5.65 usr +  0.01 sys =  5.66 CPU) @
+ 88339.22/s (n=500000)

               Rate    Unicode::Map8        regexp        Encode   Tex
+t::Iconv
Unicode::Map8  65274/s            --          -26%          -36%      
+    -71%
regexp         88339/s           35%            --          -13%      
+    -61%
Encode        101420/s           55%           15%            --      
+    -55%
Text::Iconv   227273/s          248%          157%          124%      
+      --
[download]

Note: I am not an expert in using Benchmark, so please let me know if my test is flawed.

[reply]
[d/l]
[select]


more useful options
	PerlMonks