Re: length() miscounting UTF8 characters?
by Jim (Curate) on Apr 28, 2014 at 02:50 UTC
|
Here's a Perl script that counts the number of bytes, code points, and graphemes in each UTF-8 encoded word. It also tallies the code points by Unicode blocks.
Here's the output of the script.
æ | Bytes: 2 | Code Points: 1 | Graphemes: 1 | Blocks: Latin-1 Supplement (1)
æð | Bytes: 4 | Code Points: 2 | Graphemes: 2 | Blocks: Latin-1 Supplement (2)
æða | Bytes: 5 | Code Points: 3 | Graphemes: 3 | Blocks: Basic Latin (1), Latin-1 Supplement (2)
æðaber | Bytes: 8 | Code Points: 6 | Graphemes: 6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
æðahnútur | Bytes: 12 | Code Points: 9 | Graphemes: 9 | Blocks: Basic Latin (6), Latin-1 Supplement (3)
æðakölkun | Bytes: 12 | Code Points: 9 | Graphemes: 9 | Blocks: Basic Latin (6), Latin-1 Supplement (3)
æðardúnn | Bytes: 11 | Code Points: 8 | Graphemes: 8 | Blocks: Basic Latin (5), Latin-1 Supplement (3)
æðarfugl | Bytes: 10 | Code Points: 8 | Graphemes: 8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
æðarkolla | Bytes: 11 | Code Points: 9 | Graphemes: 9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
æðarkóngur | Bytes: 13 | Code Points: 10 | Graphemes: 10 | Blocks: Basic Latin (7), Latin-1 Supplement (3)
æðarvarp | Bytes: 10 | Code Points: 8 | Graphemes: 8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
æði | Bytes: 5 | Code Points: 3 | Graphemes: 3 | Blocks: Basic Latin (1), Latin-1 Supplement (2)
æðimargur | Bytes: 11 | Code Points: 9 | Graphemes: 9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
æðisgenginn | Bytes: 13 | Code Points: 11 | Graphemes: 11 | Blocks: Basic Latin (9), Latin-1 Supplement (2)
æðiskast | Bytes: 10 | Code Points: 8 | Graphemes: 8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
æðislegur | Bytes: 11 | Code Points: 9 | Graphemes: 9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
æðrast | Bytes: 8 | Code Points: 6 | Graphemes: 6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
æðri | Bytes: 6 | Code Points: 4 | Graphemes: 4 | Blocks: Basic Latin (2), Latin-1 Supplement (2)
æðrulaus | Bytes: 10 | Code Points: 8 | Graphemes: 8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
æðruleysi | Bytes: 11 | Code Points: 9 | Graphemes: 9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
æðruorð | Bytes: 10 | Code Points: 7 | Graphemes: 7 | Blocks: Basic Latin (4), Latin-1 Supplement (3)
æðrutónn | Bytes: 11 | Code Points: 8 | Graphemes: 8 | Blocks: Basic Latin (5), Latin-1 Supplement (3)
æðstur | Bytes: 8 | Code Points: 6 | Graphemes: 6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
æður | Bytes: 6 | Code Points: 4 | Graphemes: 4 | Blocks: Basic Latin (2), Latin-1 Supplement (2)
æfa | Bytes: 4 | Code Points: 3 | Graphemes: 3 | Blocks: Basic Latin (2), Latin-1 Supplement (1)
UPDATE: If you add these three words to the end of the list in the __DATA__ block of the the UTF-8 encoded Perl script…
한국말
piñón
piñón
…then the report will include these three lines…
한국말 | Bytes: 9 | Code Points: 3 | Graphemes: 3 | Blocks: Hangul Syllables (3)
piñón | Bytes: 7 | Code Points: 5 | Graphemes: 5 | Blocks: Basic Latin (3), Latin-1 Supplement (2)
piñón | Bytes: 9 | Code Points: 7 | Graphemes: 5 | Blocks: Basic Latin (5), Combining Diacritical Marks (2)
| [reply] [d/l] |
|
Wow, I don't know what to say, that script is extremely helpful and should come in very handy! Thanks a bunch, I really appreciate the effort you went to there.
I never expected this much useful feedback when I turned to PM at a friend's suggestion. So again, thanks to you and everyone else, I'm really impressed.
| [reply] |
|
| [reply] |
Re: length() miscounting UTF8 characters?
by choroba (Cardinal) on Apr 27, 2014 at 22:26 UTC
|
How do you call the script? It seems you are feeding it with STDIN, which is not affected by use open IO.
The following works for me (both in 5.16.2 and 5.10.1):
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
binmode STDOUT, 'utf8';
binmode DATA, 'encoding(utf-8)';
while (<DATA>) {
chomp;
s/[A-Za-z]//g;
say $_, ' ', length;
}
__DATA__
æ
æð
æða
æðaber
æðahnútur
æðakölkun
æðardúnn
æðarfugl
æðarkolla
æðarkóngur
æðarvarp
æði
æðimargur
æðisgenginn
æðiskast
æðislegur
æðrast
æðri
æðrulaus
æðruleysi
æðruorð
æðrutónn
æðstur
æður
æfa
| [reply] [d/l] [select] |
|
Yes, I'm piping the textfile into the script, though that's more for convenience than anything else. It'd be easy enough to change.
I read up on the open pragma again and noticed that it can be fed another subpragma, :std, to affect the STD* streams:
The :std subpragma on its own has no effect, but if combined with the :utf8 or :encoding subpragmas, it converts the standard filehandles (STDIN, STDOUT, STDERR) to comply with encoding selected for input/output handles. For example, if both input and out are chosen to be :encoding(utf8) , a :std will mean that STDIN, STDOUT, and STDERR are also in :encoding(utf8) .
So I tried changing that line to
use open IO => ':std', ':utf8';
but that didn't make a difference either. I'm probably still missing something fairly obvious.
Thanks for your help, by the way!
| [reply] [d/l] |
|
use open IO => ':utf8', ':std';
The order matters.
| [reply] [d/l] |
|
|
Re: length() miscounting UTF8 characters?
by amon (Scribe) on Apr 27, 2014 at 22:12 UTC
|
I still haven't figured out how use open is supposed to work. Hypothesis: it doesn't actually apply the IO layer to your handle. We can test that by querying the IO layers.
use open IO => ":utf8";
while (<>) {
my @layers = PerlIO::get_layers(\*ARGV);
say "(@layers)";
}
If it just outputs something like (unix perlio), it obviously didn't apply the layer.
The explicit way should work, I guess:
use strict;
use warnings;
use autodie;
binmode STDOUT, ":utf8";
for my $file (@ARGV) {
my $fh;
if ($file eq "-") {
$fh = \*STDIN;
binmode $fh, ":utf8";
} else {
open $fh, "<:utf8", $file;
}
while (<$fh>) {
s/[[:ascii:]]//g;
print length, " ", $_, "\n";
}
}
| [reply] [d/l] [select] |
|
Hey, thanks a lot, that second script actually works! That's the practical problem I was facing all solved, then.
I'm still curious why mine wasn't working. Your first script indeed just outputs "(unix perlio)", but I lack the knowledge to dig any deeper there.
| [reply] |
Re: length() miscounting UTF8 characters?
by wjw (Priest) on Apr 27, 2014 at 21:49 UTC
|
Having no experience with this, I thought I would explore a bit. So I did the following: #!/usr/bin/perl
use strict;
use warnings;
use open IO => ':utf8';
while(<DATA>) {
chomp;
(my $nonenglish = $_) =~ s/[A-Za-z]//g;
my @chars = split(//,$nonenglish);
my $chars = scalar(@chars);
print scalar(@chars), " $nonenglish\n";
}
__DATA__
æ
æð
æða
æðaber
æðahnútur
æðakölkun
æðardúnn
æðarfugl
æðarkolla
æðarkóngur
æðarvarp
æði
æðimargur
æðisgenginn
æðiskast
æðislegur
æðrast
æðri
æðrulaus
æðruleysi
æðruorð
æðrutónn
æðstur
æður
æfa
__END__
Seems split sees those letters as two chars also, which makes sense now that I think of it... . Guess I have some things to learn about UTF8!
Thanks for the opportunity! Sorry this is not all that helpful. Suppose one could take the character count and just divide by two ...
$chars = $chars / 2;
print "$chars $nonenglish\n";
...
Update: Might also take a look at CPAN Test UTF8 and related...
...the majority is always wrong, and always the last to know about it...Insanity: Doing the same thing over and over again and expecting different results...
| [reply] [d/l] [select] |
|
Yes, simply dividing by two would work here (and that's what I've been doing, mentally), but that's only because all the non-English characters encountered here are encoded as two bytes in UTF8. As soon as there'd be 3- or 4-byte characters, it'd not work anymore.
Thanks for your help! I'll take a look at that module.
| [reply] |
Re: length() miscounting UTF8 characters?
by farang (Chaplain) on Apr 28, 2014 at 06:59 UTC
|
Using the length function to count unicode characters is a bug waiting to happen. It works with your dataset
and will work with many others, but may fail on certain languages or with complex data. Much more robust is to use unicode properties.
#!/usr/bin/env perl
use warnings;
use v5.14;
binmode STDOUT, 'utf8';
binmode DATA, 'encoding(utf-8)';
while (<DATA>) {
chomp;
print $_, ': ';
s/[A-Za-z]//g;
my $alphacount = () = /\p{Alpha}/g;
say "non-[A-Za-z] symbols <$_> contain $alphacount alphabetic char
+acters";
}
__DATA__
æðaber
æðahnútur
æðakölkun
æðardúnn
æðarfugl
æðarkolla
æðarkóngur
æðarvarp
æðruorð
My standard practice has become to use utf8::all to handle all streams and
save me from specifying each stream encoding separately. There's probably some pitfalls in
using it but so far I haven't encountered any.
| [reply] [d/l] |
|
Thank you, that's very useful as well. In what sense is using length to count Unicode characters a bug waiting to happen, though? Now I'll admit I've just learned first hand that this is indeed dangerous territory to tread, but the perldoc entry for length (which I checked beforehand to make sure it wouldn't count bytes -- hence my confusion) says:
Like all Perl character operations, length() normally deals in logical characters, not physical bytes. For how many bytes a string encoded as UTF-8 would take up, use length(Encode::encode_utf8(EXPR)) (you'll have to use Encode first).
So if used right, it should work, shouldn't it? Do you have any specific languages or complex data in mind with which it might fail?
| [reply] |
|
The problems with length are not around bytes vs. characters, but that length counts code points. Many logical characters are composed from multiple code points, and some logical characters have multiple representations in Unicode.
For example, consider “á” (U+00E1 latin small letter a with acute). The same logical character could be composed of two codepoints: “á:” (U+0061 latin small letter a, U+0301 combining acute accent). So while they produce the same visual output (the same grapheme), the strings containing these would have different lengths.
So when dealing with Unicode text, it's important to think about which length you need: byte count, codepoint count, or count of graphemes (visual characters), or the actual width (there are various characters that are not one column wide – the tab, unprintable characters, and double-width characters e.g. from far-eastern scripts come to mind). The script in a previous reply takes these different counts into account.
The issue of multiple encodings for one logical character should also be kept in mind when comparing strings (testing for equality, matching, …). In general, you should normalize the input (usually the fully composed form for output, and the fully decomposed form for internal use) before trying to determine whether two strings match.
| [reply] |
|
|
|
the perldoc entry for length (which I checked
beforehand to make sure it wouldn't count bytes -- hence my
confusion)
It "normally deals in logical characters", but its logic doesn't
cover all the intricacies of unicode.
Do you have any specific languages or complex data in mind with which it might fail?
Yes, Thai language is the main one I'm involved with. The modified
script below shows that length counts diacriticals in
Thai, which may or may not be what is wanted, and is inconsistent
with the results for Latin diacriticals in your dataset, which
length isn't counting separately. I'm using
pre tags so that the Thai will display correctly and shortened lines to facilitate copy/paste.
#!/usr/bin/env perl
use warnings;
use v5.14;
use Unicode::Normalize qw/NFD/;
binmode STDOUT, 'utf8';
binmode DATA, 'encoding(utf-8)';
while (<DATA>) {
chomp;
print $_, ': ';
s/[A-Za-z]//g;
my $alphacount = () = /\p{Alpha}/g;
say "non-(A-Za-z) symbols <$_>",
" contain $alphacount",
" alphabetic characters and ",
getdia($_), " diacritical chars.";
say "length() thinks there are ",
length, " characters\n";
}
sub getdia {
my $normalized = NFD($_[0]);
my $diacount = () =
$normalized =~ /\p{Dia}/g;
return $diacount;
}
__DATA__
เป็น
ผู้หญิง
เมื่อวันก่อน
æðaber
æðahnútur
æðakölkun
| [reply] |
|
|
In what sense is using length to count Unicode characters a bug waiting to happen, though?
It's a "bug waiting to happen" when you try to make meaningful inferences about Unicode text by computing the size in bytes of the text in a specific Unicode character encoding scheme (e.g., UTF-8). This is what another monk was hinting at doing earlier in this thread when he or she suggested "dividing by two." That's a bug waiting to happen.
In general, when dealing with Unicode text, you're much more likely to need to know the numbers of code points in a string, or the numbers of graphemes in it ("extended grapheme clusters" in Unicode standardese). However, there are situations in which you might need to know the length in bytes of a Unicode string in some specific encoding. An example of this is needing to store character data in a database column with a capacity measured in bytes rather than in Unicode code points or graphemes. If you have a character data type column with a capacity of, say, 255 bytes, then the number of UTF-8 encoded Chinese characters you can insert into the column is likely going to be a lot fewer than the number of UTF-8 encoded Latin characters you can insert into the same column. In this case, knowing the size of the string in code points or graphemes won't help you answer the question "Will it fit?" You need the size in bytes.
| [reply] [d/l] |
|
|
|
Using the length function to count unicode characters is a bug waiting to happen.
Well, all perl builtins work at the codepoint level, including length. Depending on your definition of "character", that might or might not be what the OP wants.
I've attempted to implement "extended grapheme cluster" (that is, any base char + modifiers is considered a "character") logic in Perl6::Str. Feedback very welcome :-).
| [reply] |
|
Yes, "extended grapheme clusters" are what I'm apparently interested and what I'd ordinarily call "characters", rather than codepoints.
I've not looked at Perl 6 yet, but being able to work with Unicode data from a high-level perspective, without caring too much about implementation details such as the various representation layers (the encoding layer that take bytes to codepoints, and then the next one that takes codepoints to "extended grapheme clusters") would be a huge boon for many, including me.
| [reply] |
|
|
|
|
|
|
Well, all perl builtins work at the codepoint level, including length. Depending on your definition of "character", that might or might not be what the OP wants.
Sure, I'm just saying that bugs or unexpected results can occur if
care is not taken. As amon pointed out, the same visual
representation of a character with a diacritical might have either
one or two codepoints.
#!/usr/bin/env perl
use v5.14;
use warnings;
use utf8;
binmode STDOUT, 'utf8';
my $o_umlaut1 = "\x{F6}";
my $o_umlaut2 = "\x{6F}\x{308}";
my $string1 = "æð" . $o_umlaut1;
my $string2 = "æð" . $o_umlaut2;
say "length of $string1 is ", length($string1);
say "length of $string2 is ", length($string2);
__OUTPUT__
length of æðö is 3
length of æðö is 4
I'll play around with your module. Thai is somewhat unique
in that the first combining character may be another alphabetic
character, so counting extended graphemes does not necessarily give
the correct count of alphabetic characters.
| [reply] [d/l] |
Re: length() miscounting UTF8 characters?
by AppleFritter (Vicar) on Apr 27, 2014 at 21:01 UTC
|
And apologies for accidentally posting this anonymously. Eventually I'll get the hang of this site. | [reply] |
Re: length() miscounting UTF8 characters?
by dave_the_m (Monsignor) on Apr 27, 2014 at 22:32 UTC
|
The open pragma is documented to affect open() and similar ops within its lexical scope, but you aren't using open;
you're using <> to read the already-opened magic ARGV filehandle.
Dave. | [reply] [d/l] |
|
My tests show that ARGV is affected if you call the script with a file name parameter, but it's not affected if you use it to read data from the standard input.
| [reply] |
|
Thank you for your reply, and good to know! This already came up a little further up; it turns out that there is a way to make the open pragma apply there as well, if you know the right magic incantation. :)
| [reply] |
Re: length() miscounting UTF8 characters?
by RonW (Parson) on Apr 27, 2014 at 23:11 UTC
|
| [reply] |
|
That looks quite useful, and I'll take a look. Thank you for the pointer, and thanks for replying!
| [reply] |
Re: length() miscounting UTF8 characters?
by wjw (Priest) on Apr 27, 2014 at 21:09 UTC
|
Would be kind of handy to have a few of those words you are reading in to test against... :-)
...the majority is always wrong, and always the last to know about it...Insanity: Doing the same thing over and over again and expecting different results...
| [reply] |
|
æ
æð
æða
æðaber
æðahnútur
æðakölkun
æðardúnn
æðarfugl
æðarkolla
æðarkóngur
æðarvarp
æði
æðimargur
æðisgenginn
æðiskast
æðislegur
æðrast
æðri
æðrulaus
æðruleysi
æðruorð
æðrutónn
æðstur
æður
æfa
Which produces the following output:
2 æ
4 æð
4 æð
4 æð
6 æðú
6 æðö
6 æðú
4 æð
4 æð
6 æðó
4 æð
4 æð
4 æð
4 æð
4 æð
4 æð
4 æð
4 æð
4 æð
4 æð
6 æðð
6 æðó
4 æð
4 æð
2 æ
| [reply] [d/l] [select] |