Setting UTF-8 mode on filehandle reads?

jkahn has asked for the wisdom of the Perl Monks concerning the following question:

I've been trying to get utf-8 encoded files to read in properly, and to parse with character semantics after loading. It seems to me that the first two printouts should be the same, but instead the one loaded from the file while the utf8 pragma was in scope (line 2) is handling length wrong, or so it appears.

#!perl -w
use warnings;
use strict;
{
  use utf8;
  my $string = '&#601;'; 
  # this is a schwa in UTF-8, darned handy in linguistics
  print length $string,"\t",$string, "\n";

  my $filestring = <DATA>;
  chomp $filestring;
  print length $filestring, "\t", $filestring, "\n";
  # seems like it should print "1" here... but it prints 2!
}
{
  my $string = '&#601;';
  print length $string,"\t",$string, "\n";

  my $filestring = <DATA>;
  chomp $filestring;
  print length $filestring, "\t", $filestring, "\n";
}
__DATA__
&#601;
&#601;
[download]

Note it wasn't funny ampersands in the data, but an actual utf-8 character (the upside down e, U+0259 LATIN SMALL LETTER SCHWA). (darn conversions!)

Here's the results (as pre):

1	ə
2	ə
2	ə
2	ə

It's the second line that really surprises me... shouldn't that be a '1'? The only apparent difference is that it was read off a filehandle. How can I "reset" that data to be utf8?

Here's my version of Perl (I used pre tags so that d/l code would work!):

C:\>perl -v

This is perl, v5.6.1 built for MSWin32-x86-multi-thread
(with 1 registered patch, see perl -V for more detail)

Copyright 1987-2001, Larry Wall

Binary build 633 provided by ActiveState Corp. http://www.ActiveState.com
Built 21:33:05 Jun 17 2002


Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'.  If you have access to the
Internet, point your browser at http://www.perl.com/, the Perl Home Page.

Anybody have any idea what's wrong here or why it gets the length wrong?

Comment on Setting UTF-8 mode on filehandle reads? Download Code

Replies are listed 'Best First'.

Re: Setting UTF-8 mode on filehandle reads?
by grantm (Parson) on Dec 06, 2002 at 01:14 UTC

If you were using Perl 5.8, I'd suggest pushing an encoding layer when you opened the file (or after with binmode). As you're not, I won't.

Here's a quick script that reads a file line-by-line and uses pack to set the UTF-8 flag on each string read in. After that flag is set, character semantics work as expected for wide characters that were read in from the file:

  use utf8;
  use CGI::Carp qw(fatalsToBrowser);

  print "Content-type: text/html; charset=utf-8\n\n";

  open(FILE, "<", "/path/to/utf8/file.txt") || die "$!";

  print "<pre>\n";
  while(<FILE>) {
    chomp;
    $_ = set_utf($_);
    my $len = length($_);    # count of chars not bytes
    print "$_", ' ' x (72 - $len), "|\n";
  }
  print "</pre>\n";

  sub set_utf {
    return pack "U0a*", join '', @_;
  }
[download]

I fashioned the script as a CGI script so that you can view the output in your browser - which understands UTF-8 characters (whereas your TTY might not). Given a UTF-8 text file with lines less than 80 characters, this should pad each line out to 80 characters with spaces and then append a '|'. If character semantics are not in force, the length will count bytes rather than characters and the '|'s won't line up.