Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Chicanery Needed to Handle Unicode Text on Microsoft Windows

by kcott (Archbishop)
on Oct 30, 2010 at 09:15 UTC ( [id://868437]=note: print w/replies, xml ) Need Help??


in reply to Chicanery Needed to Handle Unicode Text on Microsoft Windows

As far as I can see, the only reason you need :crlf is because you've specifically added the UNIX line ending (\n) to your output. It would be better to use the platform-independent $/. The :raw layer should preserve the line endings. So that reduces the chicanery somewhat.

Except for ASCII files, binmode($file_handle) was required on MSWin32 systems. :raw performs the same function so, while perhaps appearing to add to the chicanery, it certainly reduces the amount of code.

I don't have sufficient knowledge of UTF-16 to address that aspect of you post. What I would suggest is that, after removing :crlf and changing \n to $/, you try your test code without :perlio. You may still need it but it wouldn't hurt to check.

I agree there's a lot of Unicode-related documentation; however, everything I've made reference to is available here: PerlIO.

I ran a series of tests, click on Read more... to view.

Starting code:

#!perl use 5.12.0; use warnings; my $in_file = $^O eq 'MSWin32' ? 'utf16_LE_prob.dos_dat' : 'utf16_LE_prob.unix_dat'; my $out_file = $^O eq 'MSWin32' ? 'utf16_LE_prob.dos_out' : 'utf16_LE_prob.unix_out'; my $in_mode = $^O eq 'MSWin32' ? '<:raw' : '<'; my $out_mode = $^O eq 'MSWin32' ? '>:raw' : '>'; open my $in_fh, $in_mode, $in_file or die $!; open my $out_fh, $out_mode, $out_file or die $!; while (my $line = <$in_fh>) { print $out_fh $line; } close $out_fh; close $in_fh;

Input files in UNIX and DOS formats:

$ cat -vet utf16_LE_prob.unix_dat utf16_LE_prob.dos_dat Line 1$ Line 2$ $ Line 1^M$ Line 2^M$ ^M$

Output after running on UNIX platform:

$ cat -vet utf16_LE_prob.unix_out Line 1$ Line 2$ $

Output after running on DOS platform:

$ cat -vet utf16_LE_prob.dos_out Line 1^M$ Line 2^M$ ^M$

Changing the while loop to chomp input and add $/ (not \n) to output:

while (my $line = <$in_fh>) { chomp $line, print $out_fh $line, $/; }

New output:

$ cat -vet utf16_LE_prob.unix_out utf16_LE_prob.dos_out Line 1$ Line 2$ $ Line 1^M$ Line 2^M$ ^M$

Adding :crlf to MSWin32 input and output modes (now = :raw:crlf) and there's no change:

$ cat -vet utf16_LE_prob.unix_out utf16_LE_prob.dos_out Line 1$ Line 2$ $ Line 1^M$ Line 2^M$ ^M$

With :raw:perlio:crlf, there's no change:

$ cat -vet utf16_LE_prob.unix_out utf16_LE_prob.dos_out Line 1$ Line 2$ $ Line 1^M$ Line 2^M$ ^M$

And, for completeness, with :raw:perlio, there's no change:

$ cat -vet utf16_LE_prob.unix_out utf16_LE_prob.dos_out Line 1$ Line 2$ $ Line 1^M$ Line 2^M$ ^M$

-- Ken

Replies are listed 'Best First'.
Re^2: Chicanery Needed to Handle Unicode Text on Microsoft Windows
by Jim (Curate) on Oct 30, 2010 at 18:36 UTC
    As far as I can see, the only reason you need :crlf is because you've specifically added the UNIX line ending (\n) to your output.

    :crlf is needed here to get the same platform-independent line-ending handling of plain text files Perl has always supported. Without it, the line-ending handling is badly broken. Half of the line-ending character pair CRLF is missed.

    D:\>cat Demo.pl #!perl use strict; use warnings; open my $input_fh, '<:raw:perlio:encoding(UTF-16LE)', 'Input.txt'; while (my $line = <$input_fh>) { chomp $line; print "There's an unexpected/unwanted CR at the end of the line\n" if $line =~ m/\r$/; } D:\>file Input.txt Input.txt: Text file, Unicode little endian format D:\>cat Input.txt We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America. D:\>perl Demo.pl Input.txt There's an unexpected/unwanted CR at the end of the line There's an unexpected/unwanted CR at the end of the line There's an unexpected/unwanted CR at the end of the line There's an unexpected/unwanted CR at the end of the line There's an unexpected/unwanted CR at the end of the line D:\>

    And as Anonymous Monk has already pointed out, \n is the express mechanism in Perl intended to make line-ending handling platform-independent. It is defined not to mean the LF-only Unix line-ending, but rather to mean whatever the line-ending character or character combination terminates lines of plain text files on the platform in use.

    It would be better to use the platform-independent $/.

    No it wouldn't. And even if it were better, how would someone new to Perl ever figure that out. I've been programming Perl for years and I've never once seen $/ used in place of the usual and ordinary \n. chomp()-ing and "...\n"-ing are the long-lived and ubiquitous standard idioms.

    #!perl print "Hello, world\n";
    Except for ASCII files, binmode($file_handle) was required on MSWin32 systems. :raw performs the same function so, while perhaps appearing to add to the chicanery, it certainly reduces the amount of code.

    But this is the whole point. The file named Input.txt is not a binary file; it's a plain text file. All the Unicode files I want to manipulate on Microsoft Windows using Perl, the text-processing scripting language, are plain text files. binmode() and :raw are lies. Chicanery.

    In my humble opinion, this should work on a Unicode UTF-16 file with a byte order mark.

    #!perl use strict; use warnings; open my $input_fh, '<', 'Input.txt'; open my $output_fh, '>', 'Output.txt'; while (my $line = <$input_fh>) { chomp $line; print $output_fh "$line\n"; }

    It seems perfectly reasonable to me to expect the scripting language to determine the character encoding of the file all by its little lonesome — it only has to read the first two bytes of the file — and just to do the right thing.

      The documentation on :raw says that CRLF conversion is turned off. It appears that \n in the print statement is represented as CRLF before the arguments to print enter the output stream so \n can be used as normal.

      Changing $/ to \n in my tests (not surprisingly) produces the same results.

      -- Ken

Re^2: Chicanery Needed to Handle Unicode Text on Microsoft Windows
by Anonymous Monk on Oct 30, 2010 at 10:47 UTC
    Um, the purpose of crlf is so that you can use \n and it will the appropriate thing for your platform -- \n is portable.

      See response to Jim (below).

      -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://868437]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (3)
As of 2024-04-19 17:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found