PerlIO: crlf layer on Windows interfering with UCS-2 unicode

almut has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone,

I need to read/write UCS-2 unicode files on Windows. I thought that specifying the appropriate PerlIO layer with open or binmode should suffice. However, this naive approach doesn't seem to work.

For example, trying to write a unicode UCS-2LE file like this (which is supposed to create two lines, each containing the unicode character codepoint U+8765)

my $filename = "test.ucs2le";
open my $fh, ">:encoding(ucs-2le)", $filename
             or die "Cannot open $filename for writing: $!";
print $fh "\x{feff}\x{8765}\n\x{8765}\n";
close $fh;
[download]

produces an incorrectly encoded file on Windows (works fine on Unix). The output file displays as garbage in unicode capable editors like notepad, and produces "UCS-2LE:Partial character ..." warnings, when you try to read the file back in from Perl.

Inspecting the hex dump of the file (e.g. with "od -tx1 -An test.ucs2le" on unix/cygwin),

 ff fe 65 87 0d 0a 00 65 87 0d 0a 00
[download]

shows that the newline characters \n (or 0a in hex) have been replaced by \r\n (0d 0a in hex). Kind of like expected, except that with the 2-byte wide UCS-2 encoding, the 000a should've been turned into 000d 000a. IOW, the proper UCS-2LE encoding would have been:

 ff fe 65 87 0d 00 0a 00 65 87 0d 00 0a 00
[download]

Looking at the PerlIO layer stack which is in effect when specifying :encoding(ucs-2le), reveals that the crlf layer (windows-specific default) is being applied after the UCS-2LE layer has turned characters into 2-byte values:

my $filename = "test.ucs2le";
open my $fh, ">:encoding(ucs-2le)", $filename or die;
my @layers = PerlIO::get_layers($fh); print "@layers\n";
[download]

outputs

unix crlf encoding(UCS-2LE) utf8
[download]

(Note that, when writing, layers are being applied from right-to-left, while when reading, they're being applied from left-to-right. IOW, the left hand side of the layer stack as shown corresponds to the external side (file), and the right hand side is the Perl-internal data representation.)

Trying to find a workaround, I've been fiddling with this for quite a while. Finally, I came up with the following layer stack, which seems to do the trick:

my $filename = "test.ucs2le";
open my $fh, ">:raw:encoding(ucs-2le):crlf:utf8", $filename
             or die "Cannot open $filename for writing: $!";
print $fh "\x{feff}\x{8765}\n\x{8765}\n";
close $fh;
[download]

The :raw:encoding(ucs-2le):crlf:utf8 results in the following layers:

unix encoding(UCS-2LE) utf8 crlf utf8
[download]

:raw removes the initial default crlf layer, :encoding(ucs-2le) adds the desired UCS-2 layer plus an automatically appended utf8, :crlf puts the crlf layer in its proper position (such that it is being applied before conversion to 2-byte values happens), and the final :utf8 adds another utf8 layer. The latter is required because the crlf layer apparently is removing the UTF8-ness, without which unicode data would not be handled properly.

Although the duplicated utf8 layer doesn't seem to cause any problems, I'm not entirely sure if it'd always be completely free of side effects. (I haven't found a way to get rid of the first utf8 ... trying :pop to remove it is futile, as this also pops encoding(ucs-2le))

The same layers are needed for reading UCS-2 data, of course. In this case, the crlf conversion (i.e. \r\n --> \n) has to work on single-byte values, i.e. after the data has passed the UCS-2 filter. Otherwise, the filter would not detect the \r\n sequences, and we'd be left with an extraneous \r char at the end of every line (in which case chomp, with its default $/="\n", would no longer work as intended; and all kinds of other potential problems...).

OK, so far so good. OTOH, is it only me thinking this is somewhat too involved for the average programmer looking for an easy, straightforward way to handle UCS-2 data? Is there a less cumbersome way to achieve the same effect? Or is this even a bug - or a known but yet unresolved issue?

This isn't UCS-2 specific, BTW. Any encoding with a minimal character size of more than one byte (like UTF-16, UTF-32) should pose similar problems...

Thanks,
Almut

Comment on PerlIO: crlf layer on Windows interfering with UCS-2 unicode Select or Download Code

Replies are listed 'Best First'.
Re: PerlIO: crlf layer on Windows interfering with UCS-2 unicode by ikegami (Patriarch) on Jun 03, 2008 at 10:53 UTC
Surprisingly, the trick only works for `open`, not `binmode`. `use strict; use warnings; sub dump_layers(*) { my @layers = PerlIO::get_layers($_[0]); print STDERR "@layers\n"; } my $file = 'temp'; { open(my $fh, '>:raw:encoding(ucs-2le):crlf:utf8', $file) or die; dump_layers($fh); } { open(my $fh, '<:raw:encoding(ucs-2le):crlf:utf8', $file) or die; dump_layers($fh); } unlink $file; binmode STDOUT, ':raw:encoding(ucs-2le):crlf:utf8' or die; dump_layers STDOUT; binmode STDIN, ':raw:encoding(ucs-2le):crlf:utf8' or die; dump_layers STDIN;` [download] `unix encoding(UCS-2LE) utf8 crlf utf8 unix encoding(UCS-2LE) utf8 crlf utf8 unix crlf encoding(UCS-2LE) utf8 unix crlf encoding(UCS-2LE) utf8` [download] The only solution I've found follows, but I don't like it. `binmode $fh, ':raw:pop:encoding(ucs-2le):crlf:utf8'; # ^^^` [download] The issue is that `:raw` disables the existing `:crlf` layer (but doesn't remove it) when using `binmode`. Then, the later `:crlf` reactivates the earlier `:crlf` layer instead of adding a new layer, messing everything up.	[reply] [d/l] [select]
Re: PerlIO: crlf layer on Windows interfering with UCS-2 unicode by ikegami (Patriarch) on May 19, 2008 at 22:56 UTC
Great post. I ++ed it when you originally posted it. I would just like to clarify one statement. Any encoding with a minimal character size of more than one byte (like UTF-16, UTF-32) should pose similar problems. That's not a sufficient condition. For example, the layer order should interfere with EBCDIC-based encodings, even if each character maps to a single byte. The following encodings should be negatively affected by the default layer order: Encodings which map LF to something other than 0A. Encodings which map CR to something other than 0D. Encodings which map any character other than LF and CR to a sequence of bytes that contains 0A or 0D.	[reply]
Re: PerlIO: crlf layer on Windows interfering with UCS-2 unicode by ikegami (Patriarch) on Feb 15, 2010 at 06:43 UTC
Note that your proposed solution disables buffering. Replace `open my $fh, ">:raw:encoding(ucs-2le):crlf:utf8", $qfn` [download] with `open my $fh, ">:raw:perlio:encoding(ucs-2le):crlf:utf8", $qfn` [download] to support buffering. Also, there doesn't appear to be a need to specify `:utf8`, so all you need is `open my $fh, ">:raw:perlio:encoding(ucs-2le):crlf", $qfn` [download]	[reply] [d/l] [select]
Re: PerlIO: crlf layer on Windows interfering with UCS-2 unicode (5.8.9) by ikegami (Patriarch) on Dec 17, 2008 at 20:47 UTC
perl589delta claims "Using `:crlf` and `UTF-16` IO layers together will now work." I don't see any difference, though. Any ideas?	[reply] [d/l] [select]


more useful options
	PerlMonks