http://qs321.pair.com?node_id=868432


in reply to Chicanery Needed to Handle Unicode Text on Microsoft Windows

He Googles for help

utf16le site:perlmonks.org
UTF-16 on WinXP written by Perl shows whitespaces.
crlf mess in unicode utf-16le

Can someone explain how this sequence of PerlIO layers works?

See PerlIO

Why must so many layers be used?

Because of the defaults, see PerlIO

Can these layers be specified using the open pragma? If so, how? If not, why not?

This should work

use open qw' IO :raw:perlio:encoding(UTF-16LE):crlf ';
but apparently open pragma is broken and doesn't accept the same things as binmode/open

And why has this ancient Perl bug still not been fixed in 5.12.2?

I'm not a perl5-porter so I'm not sure, but it doesn't look like a bug exactly, and nobodys come up with a better way, or reported a bug (that I could find).

It seems there's no way to generate a UTF-16 file in little-endian byte order directly. To generate such a file, you have to specify the UTF-16LE CES (which is wrong) and add the byte order mark explictly to make it UTF-16 instead of UTF-16LE.

maybe :encoding(UTF-16LE):via(File::BOM)

Replies are listed 'Best First'.
Re^2: Chicanery Needed to Handle Unicode Text on Microsoft Windows
by brxnd (Initiate) on Sep 26, 2012 at 03:07 UTC

    This thread is refreshing to read!!! As a Windows user that is somewhat new to Perl, I spent the past few hours trying to figure out why one of my supplied 193 xml files would keep outputting as a bunch of Chinese (?) characters. Jim described exactly what I kept trying.

    I finished my script. Everything else works - it does all my replaces beautifully. I have maybe spent 8 hours total on my script and it will save me about 3 days of work.

    But, for now, I have to go to that specific XML file, open it in Notepad, and save it as 'ANSI' instead of 'Unicode' before my script will work right.

    I have tried adding the use ' $string' supplied in this thread, but I get this error:

    Unknown PerlIO layer 'raw:perlio:encoding(UTF-16LE):crlf:utf8'

    I really would like to create re-usable code out of my script, but I have yet to find the answer.

      I have tried adding the use ' $string' supplied in this thread, but I get this error:

      Which perl version do you have?

      open it in Notepad, and save it as 'ANSI' instead of 'Unicode' before my script will work right.

      You probably shouldn't do that :) save as UTF-8 instead

      iconv -f UTF-16 -t UTF-8 < in > out

      piconv -f UTF-16LE -t UTF-8 < in > out