I've been trying to get utf-8 encoded files to read in properly, and to parse with character semantics after loading. It seems to me that the first two printouts should be the same, but instead the one loaded from the file while the
utf8 pragma was in scope (line 2) is handling
length wrong, or so it appears.
#!perl -w
use warnings;
use strict;
{
use utf8;
my $string = 'ə';
# this is a schwa in UTF-8, darned handy in linguistics
print length $string,"\t",$string, "\n";
my $filestring = <DATA>;
chomp $filestring;
print length $filestring, "\t", $filestring, "\n";
# seems like it should print "1" here... but it prints 2!
}
{
my $string = 'ə';
print length $string,"\t",$string, "\n";
my $filestring = <DATA>;
chomp $filestring;
print length $filestring, "\t", $filestring, "\n";
}
__DATA__
ə
ə
Note it wasn't funny ampersands in the data, but an actual utf-8 character (the upside down e, U+0259
LATIN SMALL LETTER SCHWA). (darn conversions!)
Here's the results (as pre):
1 ə
2 ə
2 ə
2 ə
It's the second line that really surprises me... shouldn't that be a '1'? The only apparent difference is that it was read off a filehandle. How can I "reset" that data to be utf8?
Here's my version of Perl (I used pre tags so that d/l code would work!):
C:\>perl -v
This is perl, v5.6.1 built for MSWin32-x86-multi-thread
(with 1 registered patch, see perl -V for more detail)
Copyright 1987-2001, Larry Wall
Binary build 633 provided by ActiveState Corp. http://www.ActiveState.com
Built 21:33:05 Jun 17 2002
Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'. If you have access to the
Internet, point your browser at http://www.perl.com/, the Perl Home Page.
Anybody have any idea what's wrong here or why it gets the
length wrong?