Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

CR-LF on UTF-16LE files on Windows

by vitoco (Hermit)
on Nov 07, 2018 at 16:20 UTC ( [id://1225363]=perlquestion: print w/replies, xml ) Need Help??

vitoco has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I need to update data in some files that are UTF-16LE BOM files, but I've found that the output files are being corrupted with an extra 0x0D char on every line. How could I avoid this? What am I missing?

#!perl use strict; use warnings; open IN, "<:encoding(UTF-16LE)", "in.xml"; open OUT, ">:encoding(UTF-16LE)", "out.xml"; while (<IN>) { print OUT $_; } close OUT; close IN;

As it could be seen, there are no chomp, explicit decode, or whatever in the loop, it is just printing what it was read. In the input file, each line ends with the hex sequence 0D-00-0A-00, but in the output the sequence is 0D-00-0D-0A-00, adding one byte.

BTW, I don't want to parse the XML files as such, I'll just map some values based on an external dictionary (using a pre-loaded hash)...

EDIT: typo... Thanks!!!

Replies are listed 'Best First'.
Re: CR-LF on UTF-16LE files on Windows
by ikegami (Patriarch) on Nov 07, 2018 at 17:31 UTC

    A :crlf layer is automatically added on Windows. :crlf converts 0D 0A into 0A on read, and it converts 0A into 0D 0A on write.

    :crlf is unfortunately added before the explicitly-specified layers, so it's performing the conversion on the encoded strings when it should be performed on the decoded strings. For ASCII-based encodings (e.g. UTF-8), this isn't a problem. But for UTF-16le, this does the wrong thing.

    You can address the problem by using the following:

    open my $IN, "<:raw:encoding(UTF-16LE):crlf", "in.xml"; open my $OUT, ">:raw:encoding(UTF-16LE):crlf", "out.xml";

    :raw prevents the :crlf layer from being added in the first place, and then we add it explicitly on the right side* of the :encoding layer.

    (Note that I also replaced the needless use of global variables with the use of lexically-scoped variables.)


    * — Pun intended.

      Ok, that worked as expected. I need to read more about layers to fully understand this.

      Thanks!

        :crlf converts 0D 0A into 0A on read, and it converts 0A into 0D 0A on write. This was being done to the encoded strings when it should have been done to the decoded strings.

        (My earlier post has been edited to integrate this.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1225363]
Approved by Paladin
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-25 22:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found