length() miscounting UTF8 characters?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Oh monks most exalted and wise, a humble novice seeks to benefit from your vast knowledge of all things Perl.

I have a text file (UTF8-encoded) containing words, one per line, and I'm interested in finding out, for each word, how many characters are in it that do not appear in the English alphabet. So I whipped up a script to find out:

#!/usr/bin/perl

use open IO => ':utf8';

while(<>) {
  chomp;
  ($nonenglish = $_) =~ s/[A-Za-z]//g;
  print length($nonenglish), " $nonenglish\n";
}
[download]

Alas, it isn't working; length appears to be counting bytes rather than characters! To wit:

$ perl nonenglish.pl
æ
2 æ
&#42853;
3 &#42853;
&#173782;
4 &#173782;
^D
$
[download]

I only occasionally use Perl, and I have no idea what I'm doing wrong. Enlighten me, kind monks, I beseech you!

My Perl is 5.14.2, BTW (and I cannot easily upgrade).

Comment on length() miscounting UTF8 characters? Select or Download Code

Replies are listed 'Best First'.
Re: length() miscounting UTF8 characters? by Jim (Curate) on Apr 28, 2014 at 02:50 UTC
Here's a Perl script that counts the number of bytes, code points, and graphemes in each UTF-8 encoded word. It also tallies the code points by Unicode blocks. Read more... (2 kB) Here's the output of the script. æ \| Bytes: 2 \| Code Points: 1 \| Graphemes: 1 \| Blocks: Latin-1 Supplement (1) æð \| Bytes: 4 \| Code Points: 2 \| Graphemes: 2 \| Blocks: Latin-1 Supplement (2) æða \| Bytes: 5 \| Code Points: 3 \| Graphemes: 3 \| Blocks: Basic Latin (1), Latin-1 Supplement (2) æðaber \| Bytes: 8 \| Code Points: 6 \| Graphemes: 6 \| Blocks: Basic Latin (4), Latin-1 Supplement (2) æðahnútur \| Bytes: 12 \| Code Points: 9 \| Graphemes: 9 \| Blocks: Basic Latin (6), Latin-1 Supplement (3) æðakölkun \| Bytes: 12 \| Code Points: 9 \| Graphemes: 9 \| Blocks: Basic Latin (6), Latin-1 Supplement (3) æðardúnn \| Bytes: 11 \| Code Points: 8 \| Graphemes: 8 \| Blocks: Basic Latin (5), Latin-1 Supplement (3) æðarfugl \| Bytes: 10 \| Code Points: 8 \| Graphemes: 8 \| Blocks: Basic Latin (6), Latin-1 Supplement (2) æðarkolla \| Bytes: 11 \| Code Points: 9 \| Graphemes: 9 \| Blocks: Basic Latin (7), Latin-1 Supplement (2) æðarkóngur \| Bytes: 13 \| Code Points: 10 \| Graphemes: 10 \| Blocks: Basic Latin (7), Latin-1 Supplement (3) æðarvarp \| Bytes: 10 \| Code Points: 8 \| Graphemes: 8 \| Blocks: Basic Latin (6), Latin-1 Supplement (2) æði \| Bytes: 5 \| Code Points: 3 \| Graphemes: 3 \| Blocks: Basic Latin (1), Latin-1 Supplement (2) æðimargur \| Bytes: 11 \| Code Points: 9 \| Graphemes: 9 \| Blocks: Basic Latin (7), Latin-1 Supplement (2) æðisgenginn \| Bytes: 13 \| Code Points: 11 \| Graphemes: 11 \| Blocks: Basic Latin (9), Latin-1 Supplement (2) æðiskast \| Bytes: 10 \| Code Points: 8 \| Graphemes: 8 \| Blocks: Basic Latin (6), Latin-1 Supplement (2) æðislegur \| Bytes: 11 \| Code Points: 9 \| Graphemes: 9 \| Blocks: Basic Latin (7), Latin-1 Supplement (2) æðrast \| Bytes: 8 \| Code Points: 6 \| Graphemes: 6 \| Blocks: Basic Latin (4), Latin-1 Supplement (2) æðri \| Bytes: 6 \| Code Points: 4 \| Graphemes: 4 \| Blocks: Basic Latin (2), Latin-1 Supplement (2) æðrulaus \| Bytes: 10 \| Code Points: 8 \| Graphemes: 8 \| Blocks: Basic Latin (6), Latin-1 Supplement (2) æðruleysi \| Bytes: 11 \| Code Points: 9 \| Graphemes: 9 \| Blocks: Basic Latin (7), Latin-1 Supplement (2) æðruorð \| Bytes: 10 \| Code Points: 7 \| Graphemes: 7 \| Blocks: Basic Latin (4), Latin-1 Supplement (3) æðrutónn \| Bytes: 11 \| Code Points: 8 \| Graphemes: 8 \| Blocks: Basic Latin (5), Latin-1 Supplement (3) æðstur \| Bytes: 8 \| Code Points: 6 \| Graphemes: 6 \| Blocks: Basic Latin (4), Latin-1 Supplement (2) æður \| Bytes: 6 \| Code Points: 4 \| Graphemes: 4 \| Blocks: Basic Latin (2), Latin-1 Supplement (2) æfa \| Bytes: 4 \| Code Points: 3 \| Graphemes: 3 \| Blocks: Basic Latin (2), Latin-1 Supplement (1) UPDATE: If you add these three words to the end of the list in the `__DATA__` block of the the UTF-8 encoded Perl script… 한국말 piñón piñón …then the report will include these three lines… 한국말 \| Bytes: 9 \| Code Points: 3 \| Graphemes: 3 \| Blocks: Hangul Syllables (3) piñón \| Bytes: 7 \| Code Points: 5 \| Graphemes: 5 \| Blocks: Basic Latin (3), Latin-1 Supplement (2) piñón \| Bytes: 9 \| Code Points: 7 \| Graphemes: 5 \| Blocks: Basic Latin (5), Combining Diacritical Marks (2)	[reply] [d/l]
Re^2: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 28, 2014 at 09:37 UTC
Wow, I don't know what to say, that script is extremely helpful and should come in very handy! Thanks a bunch, I really appreciate the effort you went to there. I never expected this much useful feedback when I turned to PM at a friend's suggestion. So again, thanks to you and everyone else, I'm really impressed.	[reply]
Re^3: length() miscounting UTF8 characters? by Jim (Curate) on Apr 28, 2014 at 18:39 UTC
Bear in mind that the script is written using very didactic code. It's longer and more verbose than the same script would be if its main purpose wasn't to teach a lesson.	[reply]
Re: length() miscounting UTF8 characters? by choroba (Cardinal) on Apr 27, 2014 at 22:26 UTC
How do you call the script? It seems you are feeding it with STDIN, which is not affected by `use open IO`. The following works for me (both in 5.16.2 and 5.10.1): `#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; binmode STDOUT, 'utf8'; binmode DATA, 'encoding(utf-8)'; while (<DATA>) { chomp; s/[A-Za-z]//g; say $_, ' ', length; } __DATA__ æ æð æða æðaber æðahnútur æðakölkun æðardúnn æðarfugl æðarkolla æðarkóngur æðarvarp æði æðimargur æðisgenginn æðiskast æðislegur æðrast æðri æðrulaus æðruleysi æðruorð æðrutónn æðstur æður æfa` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re^2: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 27, 2014 at 22:42 UTC
Yes, I'm piping the textfile into the script, though that's more for convenience than anything else. It'd be easy enough to change. I read up on the open pragma again and noticed that it can be fed another subpragma, `:std`, to affect the STD* streams: The :std subpragma on its own has no effect, but if combined with the :utf8 or :encoding subpragmas, it converts the standard filehandles (STDIN, STDOUT, STDERR) to comply with encoding selected for input/output handles. For example, if both input and out are chosen to be :encoding(utf8) , a :std will mean that STDIN, STDOUT, and STDERR are also in :encoding(utf8) . So I tried changing that line to `use open IO => ':std', ':utf8';` [download] but that didn't make a difference either. I'm probably still missing something fairly obvious. Thanks for your help, by the way!	[reply] [d/l]
Re^3: length() miscounting UTF8 characters? by choroba (Cardinal) on Apr 27, 2014 at 22:51 UTC
You are almost there. `use open IO => ':utf8', ':std';` [download] The order matters. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^4: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 27, 2014 at 22:59 UTC
Re^5: length() miscounting UTF8 characters? by choroba (Cardinal) on Apr 27, 2014 at 23:06 UTC
Re: length() miscounting UTF8 characters? by amon (Scribe) on Apr 27, 2014 at 22:12 UTC
I still haven't figured out how `use open` is supposed to work. Hypothesis: it doesn't actually apply the IO layer to your handle. We can test that by querying the IO layers. `use open IO => ":utf8"; while (<>) { my @layers = PerlIO::get_layers(\ARGV); say "(@layers)"; }` [download] If it just outputs something like `(unix perlio)`, it obviously didn't apply the layer. The explicit way should work, I guess: `use strict; use warnings; use autodie; binmode STDOUT, ":utf8"; for my $file (@ARGV) { my $fh; if ($file eq "-") { $fh = \STDIN; binmode $fh, ":utf8"; } else { open $fh, "<:utf8", $file; } while (<$fh>) { s/[[:ascii:]]//g; print length, " ", $_, "\n"; } }` [download]	[reply] [d/l] [select]
Re^2: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 27, 2014 at 22:35 UTC
Hey, thanks a lot, that second script actually works! That's the practical problem I was facing all solved, then. I'm still curious why mine wasn't working. Your first script indeed just outputs "(unix perlio)", but I lack the knowledge to dig any deeper there.	[reply]
Re: length() miscounting UTF8 characters? by wjw (Priest) on Apr 27, 2014 at 21:49 UTC
Having no experience with this, I thought I would explore a bit. So I did the following: `#!/usr/bin/perl use strict; use warnings; use open IO => ':utf8'; while(<DATA>) { chomp; (my $nonenglish = $_) =~ s/[A-Za-z]//g; my @chars = split(//,$nonenglish); my $chars = scalar(@chars); print scalar(@chars), " $nonenglish\n"; } __DATA__ æ æð æða æðaber æðahnútur æðakölkun æðardúnn æðarfugl æðarkolla æðarkóngur æðarvarp æði æðimargur æðisgenginn æðiskast æðislegur æðrast æðri æðrulaus æðruleysi æðruorð æðrutónn æðstur æður æfa __END__` [download] Seems split sees those letters as two chars also, which makes sense now that I think of it... . Guess I have some things to learn about UTF8! Thanks for the opportunity! Sorry this is not all that helpful. Suppose one could take the character count and just divide by two ... `$chars = $chars / 2; print "$chars $nonenglish\n";` [download] ... Update: Might also take a look at CPAN Test UTF8 and related... ...the majority is always wrong, and always the last to know about it... Insanity: Doing the same thing over and over again and expecting different results... wjw	[reply] [d/l] [select]
Re^2: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 27, 2014 at 22:08 UTC
Yes, simply dividing by two would work here (and that's what I've been doing, mentally), but that's only because all the non-English characters encountered here are encoded as two bytes in UTF8. As soon as there'd be 3- or 4-byte characters, it'd not work anymore. Thanks for your help! I'll take a look at that module.	[reply]
Re: length() miscounting UTF8 characters? by farang (Chaplain) on Apr 28, 2014 at 06:59 UTC
Using the `length` function to count unicode characters is a bug waiting to happen. It works with your dataset and will work with many others, but may fail on certain languages or with complex data. Much more robust is to use unicode properties. `#!/usr/bin/env perl use warnings; use v5.14; binmode STDOUT, 'utf8'; binmode DATA, 'encoding(utf-8)'; while (<DATA>) { chomp; print $_, ': '; s/[A-Za-z]//g; my $alphacount = () = /\p{Alpha}/g; say "non-[A-Za-z] symbols <$_> contain $alphacount alphabetic char +acters"; } __DATA__ æðaber æðahnútur æðakölkun æðardúnn æðarfugl æðarkolla æðarkóngur æðarvarp æðruorð` [download] My standard practice has become to use utf8::all to handle all streams and save me from specifying each stream encoding separately. There's probably some pitfalls in using it but so far I haven't encountered any.	[reply] [d/l]
Re^2: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 28, 2014 at 09:42 UTC
Thank you, that's very useful as well. In what sense is using `length` to count Unicode characters a bug waiting to happen, though? Now I'll admit I've just learned first hand that this is indeed dangerous territory to tread, but the perldoc entry for `length` (which I checked beforehand to make sure it wouldn't count bytes -- hence my confusion) says: Like all Perl character operations, length() normally deals in logical characters, not physical bytes. For how many bytes a string encoded as UTF-8 would take up, use length(Encode::encode_utf8(EXPR)) (you'll have to use Encode first). So if used right, it should work, shouldn't it? Do you have any specific languages or complex data in mind with which it might fail?	[reply]
Re^3: length() miscounting UTF8 characters? by amon (Scribe) on Apr 28, 2014 at 10:10 UTC
The problems with `length` are not around bytes vs. characters, but that `length` counts code points. Many logical characters are composed from multiple code points, and some logical characters have multiple representations in Unicode. For example, consider “á” (U+00E1 latin small letter a with acute). The same logical character could be composed of two codepoints: “á:” (U+0061 latin small letter a, U+0301 combining acute accent). So while they produce the same visual output (the same grapheme), the strings containing these would have different lengths. So when dealing with Unicode text, it's important to think about which length you need: byte count, codepoint count, or count of graphemes (visual characters), or the actual width (there are various characters that are not one column wide – the tab, unprintable characters, and double-width characters e.g. from far-eastern scripts come to mind). The script in a previous reply takes these different counts into account. The issue of multiple encodings for one logical character should also be kept in mind when comparing strings (testing for equality, matching, …). In general, you should normalize the input (usually the fully composed form for output, and the fully decomposed form for internal use) before trying to determine whether two strings match.	[reply]
Re^4: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 28, 2014 at 19:32 UTC
Re^4: length() miscounting UTF8 characters? by ikegami (Patriarch) on Apr 30, 2014 at 18:38 UTC
Re^3: length() miscounting UTF8 characters? by farang (Chaplain) on Apr 28, 2014 at 14:28 UTC
the perldoc entry for length (which I checked beforehand to make sure it wouldn't count bytes -- hence my confusion) It "normally deals in logical characters", but its logic doesn't cover all the intricacies of unicode. Do you have any specific languages or complex data in mind with which it might fail? Yes, Thai language is the main one I'm involved with. The modified script below shows that `length` counts diacriticals in Thai, which may or may not be what is wanted, and is inconsistent with the results for Latin diacriticals in your dataset, which `length` isn't counting separately. I'm using pre tags so that the Thai will display correctly and shortened lines to facilitate copy/paste. #!/usr/bin/env perl use warnings; use v5.14; use Unicode::Normalize qw/NFD/; binmode STDOUT, 'utf8'; binmode DATA, 'encoding(utf-8)'; while (<DATA>) { chomp; print $_, ': '; s/[A-Za-z]//g; my $alphacount = () = /\p{Alpha}/g; say "non-(A-Za-z) symbols <$_>", " contain $alphacount", " alphabetic characters and ", getdia($_), " diacritical chars."; say "length() thinks there are ", length, " characters\n"; } sub getdia { my $normalized = NFD($_[0]); my $diacount = () = $normalized =~ /\p{Dia}/g; return $diacount; } __DATA__ เป็น ผู้หญิง เมื่อวันก่อน æðaber æðahnútur æðakölkun	[reply]
Re^4: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 28, 2014 at 19:37 UTC
Re^3: length() miscounting UTF8 characters? by Jim (Curate) on Apr 28, 2014 at 18:17 UTC
In what sense is using `length` to count Unicode characters a bug waiting to happen, though? It's a "bug waiting to happen" when you try to make meaningful inferences about Unicode text by computing the size in bytes of the text in a specific Unicode character encoding scheme (e.g., UTF-8). This is what another monk was hinting at doing earlier in this thread when he or she suggested "dividing by two." That's a bug waiting to happen. In general, when dealing with Unicode text, you're much more likely to need to know the numbers of code points in a string, or the numbers of graphemes in it ("extended grapheme clusters" in Unicode standardese). However, there are situations in which you might need to know the length in bytes of a Unicode string in some specific encoding. An example of this is needing to store character data in a database column with a capacity measured in bytes rather than in Unicode code points or graphemes. If you have a character data type column with a capacity of, say, 255 bytes, then the number of UTF-8 encoded Chinese characters you can insert into the column is likely going to be a lot fewer than the number of UTF-8 encoded Latin characters you can insert into the same column. In this case, knowing the size of the string in code points or graphemes won't help you answer the question "Will it fit?" You need the size in bytes.	[reply] [d/l]
Re^4: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 28, 2014 at 19:38 UTC
Re^4: length() miscounting UTF8 characters? by wjw (Priest) on Apr 28, 2014 at 19:37 UTC
Re^2: length() miscounting UTF8 characters? by moritz (Cardinal) on Apr 28, 2014 at 19:27 UTC
Using the length function to count unicode characters is a bug waiting to happen. Well, all perl builtins work at the codepoint level, including length. Depending on your definition of "character", that might or might not be what the OP wants. I've attempted to implement "extended grapheme cluster" (that is, any base char + modifiers is considered a "character") logic in Perl6::Str. Feedback very welcome :-). Perl 6 - the future is here, just unevenly distributed	[reply]
Re^3: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 28, 2014 at 19:47 UTC
Yes, "extended grapheme clusters" are what I'm apparently interested and what I'd ordinarily call "characters", rather than codepoints. I've not looked at Perl 6 yet, but being able to work with Unicode data from a high-level perspective, without caring too much about implementation details such as the various representation layers (the encoding layer that take bytes to codepoints, and then the next one that takes codepoints to "extended grapheme clusters") would be a huge boon for many, including me.	[reply]
Re^4: length() miscounting UTF8 characters? by raiph (Deacon) on May 01, 2014 at 07:49 UTC
Re^5: length() miscounting UTF8 characters? by AppleFritter (Vicar) on May 01, 2014 at 09:23 UTC
Some notes below your chosen depth have not been shown here
Re^4: length() miscounting UTF8 characters? by Jim (Curate) on Apr 28, 2014 at 21:55 UTC
Re^5: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 28, 2014 at 22:15 UTC
Re^3: length() miscounting UTF8 characters? by farang (Chaplain) on Apr 28, 2014 at 20:51 UTC
Well, all perl builtins work at the codepoint level, including length. Depending on your definition of "character", that might or might not be what the OP wants. Sure, I'm just saying that bugs or unexpected results can occur if care is not taken. As amon pointed out, the same visual representation of a character with a diacritical might have either one or two codepoints. `#!/usr/bin/env perl use v5.14; use warnings; use utf8; binmode STDOUT, 'utf8'; my $o_umlaut1 = "\x{F6}"; my $o_umlaut2 = "\x{6F}\x{308}"; my $string1 = "æð" . $o_umlaut1; my $string2 = "æð" . $o_umlaut2; say "length of $string1 is ", length($string1); say "length of $string2 is ", length($string2);` [download] __OUTPUT__ length of æðö is 3 length of æðö is 4 I'll play around with your module. Thai is somewhat unique in that the first combining character may be another alphabetic character, so counting extended graphemes does not necessarily give the correct count of alphabetic characters.	[reply] [d/l]
Re: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 27, 2014 at 21:01 UTC
And apologies for accidentally posting this anonymously. Eventually I'll get the hang of this site.	[reply]
Re: length() miscounting UTF8 characters? by dave_the_m (Monsignor) on Apr 27, 2014 at 22:32 UTC
The open pragma is documented to affect open() and similar ops within its lexical scope, but you aren't using open; you're using `<>` to read the already-opened magic ARGV filehandle. Dave.	[reply] [d/l]
Re^2: length() miscounting UTF8 characters? by choroba (Cardinal) on Apr 27, 2014 at 22:40 UTC
My tests show that ARGV is affected if you call the script with a file name parameter, but it's not affected if you use it to read data from the standard input. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re^2: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 28, 2014 at 09:31 UTC
Thank you for your reply, and good to know! This already came up a little further up; it turns out that there is a way to make the open pragma apply there as well, if you know the right magic incantation. :)	[reply]
Re: length() miscounting UTF8 characters? by RonW (Parson) on Apr 27, 2014 at 23:11 UTC
Maybe this would help, too: http://perldoc.perl.org/feature.html#The-%27unicode_strings%27-feature	[reply]
Re^2: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 28, 2014 at 09:32 UTC
That looks quite useful, and I'll take a look. Thank you for the pointer, and thanks for replying!	[reply]
Re: length() miscounting UTF8 characters? by wjw (Priest) on Apr 27, 2014 at 21:09 UTC
Would be kind of handy to have a few of those words you are reading in to test against... :-) ...the majority is always wrong, and always the last to know about it... Insanity: Doing the same thing over and over again and expecting different results... wjw	[reply]
Re^2: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 27, 2014 at 21:21 UTC
Certainly! Here's an excerpt: `æ æð æða æðaber æðahnútur æðakölkun æðardúnn æðarfugl æðarkolla æðarkóngur æðarvarp æði æðimargur æðisgenginn æðiskast æðislegur æðrast æðri æðrulaus æðruleysi æðruorð æðrutónn æðstur æður æfa` [download] Which produces the following output: `2 æ 4 æð 4 æð 4 æð 6 æðú 6 æðö 6 æðú 4 æð 4 æð 6 æðó 4 æð 4 æð 4 æð 4 æð 4 æð 4 æð 4 æð 4 æð 4 æð 4 æð 6 æðð 6 æðó 4 æð 4 æð 2 æ` [download]	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom