Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Unicode and locales

by moxliukas (Curate)
on Nov 11, 2002 at 11:48 UTC ( #211897=perlquestion: print w/replies, xml ) Need Help??

moxliukas has asked for the wisdom of the Perl Monks concerning the following question:

Hello, monks!

As you might already know, I am Lithuanian, and Lithuanians use a non-standard charset (ISO-8859-13 or Windows-1257). I am currently having trouble with converting UTF-8 input (that I get from certain XML-RPC server) to locally displayable text (on console). The locale that I am using on windows is called "cp1257", while on FreeBSD I am using "iso8859-13". This is all very well, but I do not even know where to start reading on how to convert UTF-8 data to something locale specific. perldoc perlunicode says that unicode charset convertion is still in development (I am using perl 5.6.1 -- I could upgrade to 5.8 on FreeBSD but not on Windows -- 5.8 is still not out there, is it?)

So... where could I start searchin for information on this? Have you solved similar problems? Perhaps there are some modules that I could use? What is the general status of Unicode support in Perl?

Thank you in advance

Replies are listed 'Best First'.
Re: Unicode and locales
by mirod (Canon) on Nov 11, 2002 at 12:28 UTC

    With 5.8 you will want to use Encode, which comes with the core. I think it has been back ported to 5.6 but I am not quite sure. In any case you can also use Text::Iconv if you have the iconv library (also available for windows, here).

    I see that cp1257 and of course iso8859-13 are supported so you should be OK.

      Unfortunately Encode needs perl version 5.7.3 (at least that's what perl -MCPAN -e 'install Encode' told me)

      I will try Text::Iconv and see if that works.

      Oh, and I have found the module Unicode::Map8. After reading the docs I am still not sure if it can be relevant to what I am doing. Can anyone enlighten me?

      I guess I'll have to upgrade to 5.8 on FreeBSD machine. It is probably high time to do it anyway ;)

        Unicode::Map8 (you need Unicode::String too) also do conversions and they don't rely on iconv. This means that they are probably more portable, but likely slower than Text::Iconv. I usually use Text::Iconv.

        You might find converting character encodings useful, it shows you various methods to convert utf8 characters to latin1.

        Here is a version that does not use XML::Parser (adapting it to other encodings is left as a(n easy) exercice for the reader ;--):

        #!/bin/perl -w # converts XML data from UTF-8 back into latin1 # -r uses a regexp # -u uses Unicode::Strings # -i uses Text::Iconv (and the iconv library) # Note: -r does not work properly with XML::Parser 2.30 use strict; my $filter; if( $ARGV[0] eq '-r') { $filter = \&latin1; } elsif( $ARGV[0] eq '-u') { $filter= unicode_convert( 'latin1'); } elsif( $ARGV[0] eq '-i') { $filter= iconv_convert( 'latin1'); } else { die "usage: $0 [-r|-u|-i]"; } my $text= <DATA>; chomp $text; print "$text => ", $filter->( $text), "\n"; # shamelessly lifted from XML::TyePYX sub latin1 { my $text=shift; $text=~s{([\xc0-\xc3])(.)}{ my $hi = ord($1); my $lo = ord($2); chr((($hi & 0x03) <<6) | ($lo & 0x3F)) }ge; return $text; } sub unicode_convert { my $enc= shift; require Unicode::Map8; require Unicode::String; import Unicode::String qw(utf8); my $sub= eval q{ { my $cnv; sub { $cnv ||= new Unicode::Map8 ($enc) or die "Can't create converter"; return $cnv->to8 (utf8($_[0])->ucs2); } } }; return $sub; } sub iconv_convert { my $enc= shift; require Text::Iconv; my $sub= eval q{ { my $cnv; sub { $cnv ||= new Text::Iconv( 'utf8', $enc) or die "Can't create converter"; return $cnv->convert( $_[0]); } } }; return $sub; } __DATA__ texte soupçonné d'être plein de caractÚres accentués
Re: Unicode and locales
by ph0enix (Friar) on Nov 11, 2002 at 12:49 UTC

    in 5.8 you can use following code for charset conversion

    open SRC, '<:encoding(utf-8)', './text.utf' or die "src: $!"; open DST, '>:encoding(iso-8859-13)', './text.iso' or die "dst: $!"; @text = <SRC>; print DST @text; close SRC; close DST;
Re: Unicode and locales
by BrowserUk (Patriarch) on Nov 11, 2002 at 13:16 UTC

    The ActiveState Beta 1 is now available here. It may not be good enough for production purposes yet, you'd have to read the forums, but it would give you chance to try it out and get you stuff ready for the stable version.

    Nah! You're thinking of Simon Templar, originally played (on UKTV) by Roger Moore and later by Ian Ogilvy

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://211897]
Approved by valdez
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2022-05-25 04:45 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (84 votes). Check out past polls.