http://qs321.pair.com?node_id=849954

elef has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone, I have two related questions on the handling of UTF-8 characters.

1: How do you write a script that can take user input via a variable and write it to a UTF-8 text file? Here's what I have (the commented lines are my subsequent attempts to correct the issue, which failed).

#!/usr/bin/perl #use utf8; print "Text? "; chomp ($note = <STDIN>); print "\nText: ${note}"; #open(TEST, ">>:encoding(UTF-8)", "test.txt") or die "Can't open UTF-8 + encoded file: $!"; open(TEST, ">>", "test.txt") or die "Can't open file: $!"; print TEST "\nDirectly from the script: αινσφ&#337;ϊό&#369;\n"; print TEST "\nUser input via variable: $note\n"; close TEST; <STDIN>;

Note: I'm getting HTML character codes instead of the ő and ű letters in this post in the code... Character encoding strikes again.

As you can see, it takes user input and writes it to a file, along with some accented characters that are hardcoded into the script. The script itself is saved in UTF-8 to allow the use of all accented characters. It works fine on Ubuntu but fails on XP for me. On XP, the characters are printed correctly in the command line window by the print "\nText: ${note}"; line but they are corrupted in the file. The hardcoded stuff is fine, but if I type in the same accented letters when the script runs, they are mis-encoded.

By the way, the larger script this is a part of also reads accented characters from a UTF-8 file and writes them to another file, and that works fine on both Ubuntu and XP. So, essentially, I only have trouble with non-ascii characters if they are stored in a variable and written to a file from there. Any ideas?

2: I'm trying to get Spreadsheet::WriteExcel to work on UTF-8 files, and it's not looking very good. Here's my code for writing all lines of a file into Column A of a new spreadsheet:

#!/usr/bin/perl use warnings; use Spreadsheet::WriteExcel; # Create a new Excel workbook my $workbook = Spreadsheet::WriteExcel->new('perl.xls'); # Add a worksheet $worksheet = $workbook->add_worksheet; # write file to column A open (IN, "column1.txt"); $count = 0; while (<IN>) { $count ++; chomp ($_); $worksheet->write("A$count", $_); } close IN; <STDIN>;
I've been trying to read up on whether and how Spreadsheet::WriteExcel can handle UTF-8 characters, but I found no clear info. (Spreadsheet::WriteExcel: http://search.cpan.org/~jmcnamara/Spreadsheet-WriteExcel/lib/Spreadsheet/WriteExcel.pm ; its info on Unicode: http://search.cpan.org/~jmcnamara/Spreadsheet-WriteExcel/lib/Spreadsheet/WriteExcel.pm#UNICODE_IN_EXCEL - this seems to say what I'm trying should work - I have Perl 5.10)

This code does what I want it to on both Ubuntu and XP (the xls is created with the right content) but accented characters are corrupted in both OSes.

Thanks for any help!

Replies are listed 'Best First'.
Re: UTF-8 issues with Perl in general and with Spreadsheet::WriteExcel
by moritz (Cardinal) on Jul 16, 2010 at 09:49 UTC
    Please note that you have an asymmetry in your code: You want to encode output, but don't decode input. This is why things go all wrong, and you see Mojibake in your output.

    So, whenever you read something form STDIN, also do

    binmode STDIN, ':encoding(UTF-8)'; # then you can do: while (<STDIN>) { # work with $_ here }

    This decodes the input. Then use utf8; to tell perl that your script is written in UTF-8 (note that ASCII is a valid subset of UTF-8).

    Test that your terminal actually understands UTF-8, as described in this article, which also might be of general interest for you.

    SpreadSheet::WriteExcel works correctly if you supply it with decoded text strings.

    Update: clarified what I mean with the while-loop.

    Perl 6 - links to (nearly) everything that is Perl 6.
      Thank you, that sounds convincing.

      I tried simply adding "<:utf8" to the open command in the Spreadsheet:WriteExcel script and it seems to have fixed the problem.

      I understand how the same concept applies to input from STDIN in the first script, but I don't understand the actual code. Why is the while loop necessary and what do I put inside the loop? And what's the scope of "binmode STDIN..."? All that follows or just the next instance when STDIN is used? (i.e. do I just include it once at the start of the script or before each input from STDIN? - your post seems to suggest I need to add this line every time I expect input with fancy characters.)

        do I just include it once at the start of the script or before each input from STDIN?

        Including it once is sufficient.  It adds another PerlIO layer to the file handle (STDIN here), which remains in effect for the lifetime of the file handle (or until you change it again with another binmode).

      Well, I only use STDIN to get user input, which is always just one line, which I store in a variable and then use it for whatever purpose later... So I don't see how a while loop would be useful. Anyway, the more I know about this stuff, the less I understand it. I tried just adding binmode STDIN, ':encoding(UTF-8)'; to the script above, now I get a different problem: error messages of this sort: utf8 "\xFB" does not map to Unicode at [script] line 8. The output file contains the character codes instead of the characters: \xFB\x{32CB8E1}\x82\xA0

      Maybe I should be using encode() and decode() but I just don't know how they relate to "use utf8", and "binmode :encoding(UTF-8)". This is a huge mess and I feel like I'm having to fight a hundred dragons just to get some damned characters to display correctly. Why everything isn't in UTF-8 in the first place is beyond me, it's 2010 for God's sake!

      Anyway, I ran the test from your link ( http://perlgeek.de/en/article/encodings-and-unicode ) as well. The results are not good: all 4 lines are mojibake. The dragons are clearly winning.
        So I don't see how a while loop would be useful

        It was an example, with the purpose of demonstrating that you need to set the IO layer only once, and not before every reading operation. Of course you are welcome to deviate from the example.

        utf8 "\xFB" does not map to Unicode at script line 8.

        That means that your input is not in UTF-8. Find out which character encoding it is, and use the name in the :encoding($encoding_name) IO layer.

        Maybe I should be using encode() and decode() but I just don't know how they relate to "use utf8", and "binmode :encoding(UTF-8)".

        use utf8; has the same effect as adding a decode_utf8 before every string literal in your program. the :encoding(UTF-8) IO layer has the same effect as wrapping input operations in decode calls and output operations in encode calls.

        The results are not good: all 4 lines are mojibake.

        Then your next step should be either to find out which character encoding your terminal works with, or set it up to use UTF-8.

        Perl 6 - links to (nearly) everything that is Perl 6.
Re: UTF-8 issues with Perl in general and with Spreadsheet::WriteExcel
by jmcnamara (Monsignor) on Jul 16, 2010 at 15:30 UTC
      The Spreadsheet::WriteExcel issue got cleared up after the first answer, everything is fine with that. Thanks for the link, though, John, it proved to be pretty handy.

      I made a couple of meek attempts at finding out what is wrong with input from console, without success. I take input from the keyboard with <STDIN>, save it in a scalar variable and write it to a utf-8 txt file, the characters get corrupted. If I binmode STDIN beforehand, I get an error message about characters not mapping to Unicode and get character codes like \x{38232E2}\x81\xA1 in the output txt. Fixing this on my own system is not the goal; this is for a script I'm distributing quite widely so it should work on any random computer (with Windows, Mac OS or Linux...)

      As it stands, I think I'll just tell the users to stick to ASCII or they may get corrupted characters. No drama, but it's a bit annoying to have to do that.

      Here's the code in case somebody wants to poke it or has an idea:
      binmode STDIN, ':encoding(utf8)'; print "\nEnter text: "; chomp ($text = <STDIN>); open (TEST, ">:encoding(UTF-8)", "test.txt") or die "Can't open file: +$!"; print TEST $text;
Re: UTF-8 issues with Perl in general and with Spreadsheet::WriteExcel
by pysome (Scribe) on Jul 16, 2010 at 10:55 UTC
    please try:
    use Encode; while (<IN>) { $count ++; chomp ($_); $worksheet->write("A$count", decode('utf8',$_)); } close IN;
Re: UTF-8 issues with Perl in general and with Spreadsheet::WriteExcel
by pysome (Scribe) on Jul 16, 2010 at 10:36 UTC
    can you try to use method
    add_format(%properties)
      Sorry, ignore the message.