Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Unicode and locales

by mirod (Canon)
on Nov 11, 2002 at 12:28 UTC ( [id://211903]=note: print w/replies, xml ) Need Help??


in reply to Unicode and locales

With 5.8 you will want to use Encode, which comes with the core. I think it has been back ported to 5.6 but I am not quite sure. In any case you can also use Text::Iconv if you have the iconv library (also available for windows, here).

I see that cp1257 and of course iso8859-13 are supported so you should be OK.

Replies are listed 'Best First'.
Re: Re: Unicode and locales
by moxliukas (Curate) on Nov 12, 2002 at 09:35 UTC

    Unfortunately Encode needs perl version 5.7.3 (at least that's what perl -MCPAN -e 'install Encode' told me)

    I will try Text::Iconv and see if that works.

    Oh, and I have found the module Unicode::Map8. After reading the docs I am still not sure if it can be relevant to what I am doing. Can anyone enlighten me?

    I guess I'll have to upgrade to 5.8 on FreeBSD machine. It is probably high time to do it anyway ;)

      Unicode::Map8 (you need Unicode::String too) also do conversions and they don't rely on iconv. This means that they are probably more portable, but likely slower than Text::Iconv. I usually use Text::Iconv.

      You might find converting character encodings useful, it shows you various methods to convert utf8 characters to latin1.

      Here is a version that does not use XML::Parser (adapting it to other encodings is left as a(n easy) exercice for the reader ;--):

      #!/bin/perl -w # converts XML data from UTF-8 back into latin1 # -r uses a regexp # -u uses Unicode::Strings # -i uses Text::Iconv (and the iconv library) # Note: -r does not work properly with XML::Parser 2.30 use strict; my $filter; if( $ARGV[0] eq '-r') { $filter = \&latin1; } elsif( $ARGV[0] eq '-u') { $filter= unicode_convert( 'latin1'); } elsif( $ARGV[0] eq '-i') { $filter= iconv_convert( 'latin1'); } else { die "usage: $0 [-r|-u|-i]"; } my $text= <DATA>; chomp $text; print "$text => ", $filter->( $text), "\n"; # shamelessly lifted from XML::TyePYX sub latin1 { my $text=shift; $text=~s{([\xc0-\xc3])(.)}{ my $hi = ord($1); my $lo = ord($2); chr((($hi & 0x03) <<6) | ($lo & 0x3F)) }ge; return $text; } sub unicode_convert { my $enc= shift; require Unicode::Map8; require Unicode::String; import Unicode::String qw(utf8); my $sub= eval q{ { my $cnv; sub { $cnv ||= new Unicode::Map8 ($enc) or die "Can't create converter"; return $cnv->to8 (utf8($_[0])->ucs2); } } }; return $sub; } sub iconv_convert { my $enc= shift; require Text::Iconv; my $sub= eval q{ { my $cnv; sub { $cnv ||= new Text::Iconv( 'utf8', $enc) or die "Can't create converter"; return $cnv->convert( $_[0]); } } }; return $sub; } __DATA__ texte soupçonné d'être plein de caractÚres accentués

        I think it's time for a benchmark here:

        Using perl 5.8.0, on Linux (Mandrake 9.0) on a rather fast machine (Athlon dual-processor 1.8):

        #!/bin/perl -w use strict; use Benchmark( 'cmpthese'); use Encode; use Text::Iconv; use Unicode::Map8; use Unicode::String qw(utf8); use utf8; my $enc= 'latin1'; my $convert_iconv = Text::Iconv->new( 'utf8', $enc); my $convert_unicode = Unicode::Map8->new ($enc); my $text= <DATA>; chomp $text; # lets just check the output! print "Encode : ", encode("iso-8859-1", $text), "\n"; print "Text::Iconv : ", $convert_iconv->convert( $text), "\n"; print "Unicode::Map8 : ", $convert_unicode->to8 (utf8($text)->ucs2), " +\n"; print "regexp : ", latin1( $text), "\n"; # now benchmark cmpthese( 500000, { 'Encode' => sub { encode("iso-8859-1", $text); + }, 'Text::Iconv' => sub { $convert_iconv->convert( $text +); }, 'Unicode::Map8' => sub { $convert_unicode->to8 (utf8($t +ext)->ucs2); }, 'regexp' => sub { latin1( $text); + }, }); sub latin1 { my $text=shift; $text=~s{([\xc0-\xc3])(.)}{ my $hi = ord($1); my $lo = ord($2); chr((($hi & 0x03) <<6) | ($lo & 0x3F)) }ge; return $text; } __DATA__ texte soupçonné d'être plein de caractÚres accentués

        Results:

        Encode : texte soupçonné d'être plein de caractères accentués Text::Iconv : texte soupçonné d'être plein de caractères accentués Unicode::Map8 : texte soupçonné d'être plein de caractères accentués regexp : texte soupçonné d'être plein de caractères accentués Benchmark: timing 500000 iterations of Encode, Text::Iconv, Unicode::M +ap8, regexp... Encode: 6 wallclock secs ( 4.91 usr + 0.02 sys = 4.93 CPU) @ + 101419.88/s (n=500000) Text::Iconv: 2 wallclock secs ( 2.20 usr + 0.00 sys = 2.20 CPU) @ + 227272.73/s (n=500000) Unicode::Map8: 7 wallclock secs ( 7.66 usr + 0.00 sys = 7.66 CPU) @ + 65274.15/s (n=500000) regexp: 6 wallclock secs ( 5.65 usr + 0.01 sys = 5.66 CPU) @ + 88339.22/s (n=500000) Rate Unicode::Map8 regexp Encode Tex +t::Iconv Unicode::Map8 65274/s -- -26% -36% + -71% regexp 88339/s 35% -- -13% + -61% Encode 101420/s 55% 15% -- + -55% Text::Iconv 227273/s 248% 157% 124% + --

        Note: I am not an expert in using Benchmark, so please let me know if my test is flawed.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://211903]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-16 16:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found