Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

XML converts to UTF-8 and causes problems for non-English language speakers

by htoug (Deacon)
on Jun 19, 2002 at 10:45 UTC ( [id://175617]=perlquestion: print w/replies, xml ) Need Help??

htoug has asked for the wisdom of the Perl Monks concerning the following question:

My problem is whenever I use XML in Perl everything gets converted to UTF-8, which is OK, as that is documented everywhere.

The docs also advise that you use utf to minimize any problems with UTF-8 and Perl.

The authors must have been english-speaking!

My language (danish) uses three not-uncommon (in danish that is) letters: æ, ø and å, that in the characterset we normally use (ISO-8859-1) are 8 bits long. In UTF-8 they are 16-bit characters. Any attempt to use them, in eg. a comment, results in syntax errors from perl. Whats more any data that we read from a file, or want to output to a file should be in ISO-8859-1 or it will be gibberish.

The way out has been to avoid use utf; and instead convert from UTF-8 to ISO-8859-1 everywhere data is fetched from XML, preferably by subclassing the XML-modules.

This is clunky, and errorprone.

Presently I sit and wait for Perl6 with line disciplines, and a new version of the operation system, that allows us to switch to UTF-8 (just leaving us with the minor task of converting all our data and files ;-)

Is there a monk out there who has another (preferably better - but I'm not picky) way of doing it?

Replies are listed 'Best First'.
Re: XML converts to UTF-8 and causes problems for non-English language speakers
by grantm (Parson) on Jun 19, 2002 at 12:39 UTC

    If you want to use non-ASCII characters in your Perl source code (eg: in identifier names or comments) and 'use utf8' (not use utf) then you need to configure your editor to save the file using utf8 encoding. With 'vim', this is the relevant option:

    :set encoding=utf-8

    The problem with the 8 bit codes (ISO-8859-1, CP1252, etc) is that most file formats other than XML don't allow you to specify which 8 bit code you've used. In theory if we all switch to utf and stop using 8 bit codes life will be easier although expect pain during the switch.

    If you don't say 'use utf8' in Perl, then it won't complain about non-ASCII characters in your source code, but it won't let you use them everywhere you might want. If you do say 'use utf8', then it will all work, as long as your source file actually is utf8 (ala the vim option above).

Re: XML converts to UTF-8 and causes problems for non-English language speakers
by mirod (Canon) on Jun 19, 2002 at 13:20 UTC

    For your data the best way is probably to do all your internal processing in utf-8, using the data you get from XML::Parser or any other module, and then to convert it, using Text::Iconv (or Encode with perl 5.8) on output.

    perl 5.8 should really help for this kind of problem: regexps and hask keys work with utf-8 and the Encode module, included in the core, handles conversion from Perl's internal format to whatever encoding you need.

    And of course you can use XML::Twig with the keep_encoding option set for the twig.

Re: XML converts to UTF-8 and causes problems for non-English language speakers
by BrowserUk (Patriarch) on Jun 19, 2002 at 14:02 UTC

    I was having similar problems with processing an XML document containing non-utf8 chars (£ ¥ €) throwing errors whilst parsing with XML::Simple::XMLin only yesterday.

    Thanks to one of the kind monks here (mirod), I added 'encoding="ISO-8859-1"' to the ?xml line at the top of my xml files as follows:

    <?xml version="1.0" encoding="ISO-8859-1"?>

    And this made the parsing errors "go away" - I have not progressed this far enough to know if this is a complete solution yet (I'm new to Perl and XML) but if you haven't already tried this, it would be worth a go.

      Thank you!1 You saved my day!
Re: XML converts to UTF-8 and causes problems for non-English language speakers
by Matts (Deacon) on Jun 19, 2002 at 14:57 UTC
    If you use XML::LibXML, you'll find it a lot more flexible with different character sets, allowing you to append nodes in ISO-8859-1, as long as you specify that's the encoding you're using.
Re: XML converts to UTF-8 and causes problems for non-English language speakers
by demerphq (Chancellor) on Jun 19, 2002 at 11:05 UTC
    Have you asked on perl5-porters? Considering the nationalities of a number of the main contributors there I imagine they may have some good suggestions.

    Yves / DeMerphq
    ---
    Writing a good benchmark isnt as easy as it might look.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://175617]
Approved by jmcnamara
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2024-04-19 15:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found