Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Hi everyone,

I need to read/write UCS-2 unicode files on Windows. I thought that specifying the appropriate PerlIO layer with open or binmode should suffice. However, this naive approach doesn't seem to work.

For example, trying to write a unicode UCS-2LE file like this (which is supposed to create two lines, each containing the unicode character codepoint U+8765)

my $filename = "test.ucs2le"; open my $fh, ">:encoding(ucs-2le)", $filename or die "Cannot open $filename for writing: $!"; print $fh "\x{feff}\x{8765}\n\x{8765}\n"; close $fh;

produces an incorrectly encoded file on Windows (works fine on Unix). The output file displays as garbage in unicode capable editors like notepad, and produces "UCS-2LE:Partial character ..." warnings, when you try to read the file back in from Perl.

Inspecting the hex dump of the file (e.g. with "od -tx1 -An test.ucs2le" on unix/cygwin),

ff fe 65 87 0d 0a 00 65 87 0d 0a 00

shows that the newline characters \n (or 0a in hex) have been replaced by \r\n (0d 0a in hex). Kind of like expected, except that with the 2-byte wide UCS-2 encoding, the 000a should've been turned into 000d 000a. IOW, the proper UCS-2LE encoding would have been:

ff fe 65 87 0d 00 0a 00 65 87 0d 00 0a 00

Looking at the PerlIO layer stack which is in effect when specifying :encoding(ucs-2le), reveals that the crlf layer (windows-specific default) is being applied after the UCS-2LE layer has turned characters into 2-byte values:

my $filename = "test.ucs2le"; open my $fh, ">:encoding(ucs-2le)", $filename or die; my @layers = PerlIO::get_layers($fh); print "@layers\n";


unix crlf encoding(UCS-2LE) utf8

(Note that, when writing, layers are being applied from right-to-left, while when reading, they're being applied from left-to-right. IOW, the left hand side of the layer stack as shown corresponds to the external side (file), and the right hand side is the Perl-internal data representation.)

Trying to find a workaround, I've been fiddling with this for quite a while. Finally, I came up with the following layer stack, which seems to do the trick:

my $filename = "test.ucs2le"; open my $fh, ">:raw:encoding(ucs-2le):crlf:utf8", $filename or die "Cannot open $filename for writing: $!"; print $fh "\x{feff}\x{8765}\n\x{8765}\n"; close $fh;

The :raw:encoding(ucs-2le):crlf:utf8 results in the following layers:

unix encoding(UCS-2LE) utf8 crlf utf8

:raw removes the initial default crlf layer,  :encoding(ucs-2le) adds the desired UCS-2 layer plus an automatically appended utf8,  :crlf puts the crlf layer in its proper position (such that it is being applied before conversion to 2-byte values happens), and the final :utf8 adds another utf8 layer. The latter is required because the crlf layer apparently is removing the UTF8-ness, without which unicode data would not be handled properly.

Although the duplicated utf8 layer doesn't seem to cause any problems, I'm not entirely sure if it'd always be completely free of side effects. (I haven't found a way to get rid of the first utf8 ... trying :pop to remove it is futile, as this also pops encoding(ucs-2le))

The same layers are needed for reading UCS-2 data, of course. In this case, the crlf conversion (i.e. \r\n --> \n) has to work on single-byte values, i.e. after the data has passed the UCS-2 filter. Otherwise, the filter would not detect the \r\n sequences, and we'd be left with an extraneous \r char at the end of every line (in which case chomp, with its default $/="\n", would no longer work as intended; and all kinds of other potential problems...).

OK, so far so good.  OTOH, is it only me thinking this is somewhat too involved for the average programmer looking for an easy, straightforward way to handle UCS-2 data? Is there a less cumbersome way to achieve the same effect? Or is this even a bug - or a known but yet unresolved issue?

This isn't UCS-2 specific, BTW. Any encoding with a minimal character size of more than one byte (like UTF-16, UTF-32) should pose similar problems...


In reply to PerlIO: crlf layer on Windows interfering with UCS-2 unicode by almut

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others meditating upon the Monastery: (5)
    As of 2020-12-03 00:16 GMT
    Find Nodes?
      Voting Booth?
      How often do you use taint mode?

      Results (48 votes). Check out past polls.