http://qs321.pair.com?node_id=1084035

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Oh monks most exalted and wise, a humble novice seeks to benefit from your vast knowledge of all things Perl.

I have a text file (UTF8-encoded) containing words, one per line, and I'm interested in finding out, for each word, how many characters are in it that do not appear in the English alphabet. So I whipped up a script to find out:

#!/usr/bin/perl use open IO => ':utf8'; while(<>) { chomp; ($nonenglish = $_) =~ s/[A-Za-z]//g; print length($nonenglish), " $nonenglish\n"; }

Alas, it isn't working; length appears to be counting bytes rather than characters! To wit:

$ perl nonenglish.pl æ 2 æ &#42853; 3 &#42853; &#173782; 4 &#173782; ^D $

I only occasionally use Perl, and I have no idea what I'm doing wrong. Enlighten me, kind monks, I beseech you!

My Perl is 5.14.2, BTW (and I cannot easily upgrade).

Replies are listed 'Best First'.
Re: length() miscounting UTF8 characters?
by Jim (Curate) on Apr 28, 2014 at 02:50 UTC

    Here's a Perl script that counts the number of bytes, code points, and graphemes in each UTF-8 encoded word. It also tallies the code points by Unicode blocks.

    Here's the output of the script.

    æ            | Bytes:  2 | Code Points:  1 | Graphemes:  1 | Blocks: Latin-1 Supplement (1)
    æð           | Bytes:  4 | Code Points:  2 | Graphemes:  2 | Blocks: Latin-1 Supplement (2)
    æða          | Bytes:  5 | Code Points:  3 | Graphemes:  3 | Blocks: Basic Latin (1), Latin-1 Supplement (2)
    æðaber       | Bytes:  8 | Code Points:  6 | Graphemes:  6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
    æðahnútur    | Bytes: 12 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (6), Latin-1 Supplement (3)
    æðakölkun    | Bytes: 12 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (6), Latin-1 Supplement (3)
    æðardúnn     | Bytes: 11 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (5), Latin-1 Supplement (3)
    æðarfugl     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
    æðarkolla    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
    æðarkóngur   | Bytes: 13 | Code Points: 10 | Graphemes: 10 | Blocks: Basic Latin (7), Latin-1 Supplement (3)
    æðarvarp     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
    æði          | Bytes:  5 | Code Points:  3 | Graphemes:  3 | Blocks: Basic Latin (1), Latin-1 Supplement (2)
    æðimargur    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
    æðisgenginn  | Bytes: 13 | Code Points: 11 | Graphemes: 11 | Blocks: Basic Latin (9), Latin-1 Supplement (2)
    æðiskast     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
    æðislegur    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
    æðrast       | Bytes:  8 | Code Points:  6 | Graphemes:  6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
    æðri         | Bytes:  6 | Code Points:  4 | Graphemes:  4 | Blocks: Basic Latin (2), Latin-1 Supplement (2)
    æðrulaus     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
    æðruleysi    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
    æðruorð      | Bytes: 10 | Code Points:  7 | Graphemes:  7 | Blocks: Basic Latin (4), Latin-1 Supplement (3)
    æðrutónn     | Bytes: 11 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (5), Latin-1 Supplement (3)
    æðstur       | Bytes:  8 | Code Points:  6 | Graphemes:  6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
    æður         | Bytes:  6 | Code Points:  4 | Graphemes:  4 | Blocks: Basic Latin (2), Latin-1 Supplement (2)
    æfa          | Bytes:  4 | Code Points:  3 | Graphemes:  3 | Blocks: Basic Latin (2), Latin-1 Supplement (1)
    

    UPDATE:  If you add these three words to the end of the list in the __DATA__ block of the the UTF-8 encoded Perl script…

    한국말
    piñón
    piñón
    

    …then the report will include these three lines…

    한국말          | Bytes:  9 | Code Points:  3 | Graphemes:  3 | Blocks: Hangul Syllables (3)
    piñón        | Bytes:  7 | Code Points:  5 | Graphemes:  5 | Blocks: Basic Latin (3), Latin-1 Supplement (2)
    piñón      | Bytes:  9 | Code Points:  7 | Graphemes:  5 | Blocks: Basic Latin (5), Combining Diacritical Marks (2)
    
      Wow, I don't know what to say, that script is extremely helpful and should come in very handy! Thanks a bunch, I really appreciate the effort you went to there. I never expected this much useful feedback when I turned to PM at a friend's suggestion. So again, thanks to you and everyone else, I'm really impressed.

        Bear in mind that the script is written using very didactic code. It's longer and more verbose than the same script would be if its main purpose wasn't to teach a lesson.

Re: length() miscounting UTF8 characters?
by choroba (Cardinal) on Apr 27, 2014 at 22:26 UTC
    How do you call the script? It seems you are feeding it with STDIN, which is not affected by use open IO. The following works for me (both in 5.16.2 and 5.10.1):
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; binmode STDOUT, 'utf8'; binmode DATA, 'encoding(utf-8)'; while (<DATA>) { chomp; s/[A-Za-z]//g; say $_, ' ', length; } __DATA__ æ æð æða æðaber æðahnútur æðakölkun æðardúnn æðarfugl æðarkolla æðarkóngur æðarvarp æði æðimargur æðisgenginn æðiskast æðislegur æðrast æðri æðrulaus æðruleysi æðruorð æðrutónn æðstur æður æfa
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Yes, I'm piping the textfile into the script, though that's more for convenience than anything else. It'd be easy enough to change.

      I read up on the open pragma again and noticed that it can be fed another subpragma, :std, to affect the STD* streams:

      The :std subpragma on its own has no effect, but if combined with the :utf8 or :encoding subpragmas, it converts the standard filehandles (STDIN, STDOUT, STDERR) to comply with encoding selected for input/output handles. For example, if both input and out are chosen to be :encoding(utf8) , a :std will mean that STDIN, STDOUT, and STDERR are also in :encoding(utf8) .

      So I tried changing that line to

      use open IO => ':std', ':utf8';

      but that didn't make a difference either. I'm probably still missing something fairly obvious.

      Thanks for your help, by the way!

        You are almost there.
        use open IO => ':utf8', ':std';

        The order matters.

        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: length() miscounting UTF8 characters?
by amon (Scribe) on Apr 27, 2014 at 22:12 UTC

    I still haven't figured out how use open is supposed to work. Hypothesis: it doesn't actually apply the IO layer to your handle. We can test that by querying the IO layers.

    use open IO => ":utf8"; while (<>) { my @layers = PerlIO::get_layers(\*ARGV); say "(@layers)"; }

    If it just outputs something like (unix perlio), it obviously didn't apply the layer.

    The explicit way should work, I guess:

    use strict; use warnings; use autodie; binmode STDOUT, ":utf8"; for my $file (@ARGV) { my $fh; if ($file eq "-") { $fh = \*STDIN; binmode $fh, ":utf8"; } else { open $fh, "<:utf8", $file; } while (<$fh>) { s/[[:ascii:]]//g; print length, " ", $_, "\n"; } }

      Hey, thanks a lot, that second script actually works! That's the practical problem I was facing all solved, then.

      I'm still curious why mine wasn't working. Your first script indeed just outputs "(unix perlio)", but I lack the knowledge to dig any deeper there.

Re: length() miscounting UTF8 characters?
by wjw (Priest) on Apr 27, 2014 at 21:49 UTC
    Having no experience with this, I thought I would explore a bit. So I did the following:
    #!/usr/bin/perl use strict; use warnings; use open IO => ':utf8'; while(<DATA>) { chomp; (my $nonenglish = $_) =~ s/[A-Za-z]//g; my @chars = split(//,$nonenglish); my $chars = scalar(@chars); print scalar(@chars), " $nonenglish\n"; } __DATA__ æ æð æða æðaber æðahnútur æðakölkun æðardúnn æðarfugl æðarkolla æðarkóngur æðarvarp æði æðimargur æðisgenginn æðiskast æðislegur æðrast æðri æðrulaus æðruleysi æðruorð æðrutónn æðstur æður æfa __END__

    Seems split sees those letters as two chars also, which makes sense now that I think of it... . Guess I have some things to learn about UTF8!
    Thanks for the opportunity! Sorry this is not all that helpful. Suppose one could take the character count and just divide by two ...
    $chars = $chars / 2; print "$chars $nonenglish\n";
    ...

    Update: Might also take a look at CPAN Test UTF8 and related...

    ...the majority is always wrong, and always the last to know about it...
    Insanity: Doing the same thing over and over again and expecting different results...

      Yes, simply dividing by two would work here (and that's what I've been doing, mentally), but that's only because all the non-English characters encountered here are encoded as two bytes in UTF8. As soon as there'd be 3- or 4-byte characters, it'd not work anymore.

      Thanks for your help! I'll take a look at that module.

Re: length() miscounting UTF8 characters?
by farang (Chaplain) on Apr 28, 2014 at 06:59 UTC

    Using the length function to count unicode characters is a bug waiting to happen. It works with your dataset and will work with many others, but may fail on certain languages or with complex data. Much more robust is to use unicode properties.

    #!/usr/bin/env perl use warnings; use v5.14; binmode STDOUT, 'utf8'; binmode DATA, 'encoding(utf-8)'; while (<DATA>) { chomp; print $_, ': '; s/[A-Za-z]//g; my $alphacount = () = /\p{Alpha}/g; say "non-[A-Za-z] symbols <$_> contain $alphacount alphabetic char +acters"; } __DATA__ æðaber æðahnútur æðakölkun æðardúnn æðarfugl æðarkolla æðarkóngur æðarvarp æðruorð
    My standard practice has become to use utf8::all to handle all streams and save me from specifying each stream encoding separately. There's probably some pitfalls in using it but so far I haven't encountered any.

      Thank you, that's very useful as well. In what sense is using length to count Unicode characters a bug waiting to happen, though? Now I'll admit I've just learned first hand that this is indeed dangerous territory to tread, but the perldoc entry for length (which I checked beforehand to make sure it wouldn't count bytes -- hence my confusion) says:

      Like all Perl character operations, length() normally deals in logical characters, not physical bytes. For how many bytes a string encoded as UTF-8 would take up, use length(Encode::encode_utf8(EXPR)) (you'll have to use Encode first).

      So if used right, it should work, shouldn't it? Do you have any specific languages or complex data in mind with which it might fail?

        The problems with length are not around bytes vs. characters, but that length counts code points. Many logical characters are composed from multiple code points, and some logical characters have multiple representations in Unicode.

        For example, consider “á” (U+00E1 latin small letter a with acute). The same logical character could be composed of two codepoints: “á:” (U+0061 latin small letter a, U+0301 combining acute accent). So while they produce the same visual output (the same grapheme), the strings containing these would have different lengths.

        So when dealing with Unicode text, it's important to think about which length you need: byte count, codepoint count, or count of graphemes (visual characters), or the actual width (there are various characters that are not one column wide – the tab, unprintable characters, and double-width characters e.g. from far-eastern scripts come to mind). The script in a previous reply takes these different counts into account.

        The issue of multiple encodings for one logical character should also be kept in mind when comparing strings (testing for equality, matching, …). In general, you should normalize the input (usually the fully composed form for output, and the fully decomposed form for internal use) before trying to determine whether two strings match.

        the perldoc entry for length (which I checked beforehand to make sure it wouldn't count bytes -- hence my confusion)
        It "normally deals in logical characters", but its logic doesn't cover all the intricacies of unicode.

        Do you have any specific languages or complex data in mind with which it might fail?
        Yes, Thai language is the main one I'm involved with. The modified script below shows that length counts diacriticals in Thai, which may or may not be what is wanted, and is inconsistent with the results for Latin diacriticals in your dataset, which length isn't counting separately. I'm using pre tags so that the Thai will display correctly and shortened lines to facilitate copy/paste.
        #!/usr/bin/env perl
        use warnings;
        use v5.14;
        use Unicode::Normalize qw/NFD/;
        binmode STDOUT, 'utf8';
        binmode DATA, 'encoding(utf-8)';
         
        while (<DATA>) {
            chomp;
            print $_, ': ';
            s/[A-Za-z]//g;
            my $alphacount = () = /\p{Alpha}/g;
            say "non-(A-Za-z) symbols <$_>", 
                " contain $alphacount", 
                " alphabetic characters and ",
                getdia($_), " diacritical chars.";
            say "length() thinks there are ", 
                length, " characters\n";
        }
        
        sub getdia {
            my $normalized = NFD($_[0]);
            my $diacount = () = 
                $normalized =~ /\p{Dia}/g;
            return $diacount;
        }
        
        __DATA__
        เป็น
        ผู้หญิง
        เมื่อวันก่อน
        æðaber
        æðahnútur
        æðakölkun
        

        In what sense is using length to count Unicode characters a bug waiting to happen, though?

        It's a "bug waiting to happen" when you try to make meaningful inferences about Unicode text by computing the size in bytes of the text in a specific Unicode character encoding scheme (e.g., UTF-8). This is what another monk was hinting at doing earlier in this thread when he or she suggested "dividing by two." That's a bug waiting to happen.

        In general, when dealing with Unicode text, you're much more likely to need to know the numbers of code points in a string, or the numbers of graphemes in it ("extended grapheme clusters" in Unicode standardese). However, there are situations in which you might need to know the length in bytes of a Unicode string in some specific encoding. An example of this is needing to store character data in a database column with a capacity measured in bytes rather than in Unicode code points or graphemes. If you have a character data type column with a capacity of, say, 255 bytes, then the number of UTF-8 encoded Chinese characters you can insert into the column is likely going to be a lot fewer than the number of UTF-8 encoded Latin characters you can insert into the same column. In this case, knowing the size of the string in code points or graphemes won't help you answer the question "Will it fit?" You need the size in bytes.

      Using the length function to count unicode characters is a bug waiting to happen.

      Well, all perl builtins work at the codepoint level, including length. Depending on your definition of "character", that might or might not be what the OP wants.

      I've attempted to implement "extended grapheme cluster" (that is, any base char + modifiers is considered a "character") logic in Perl6::Str. Feedback very welcome :-).

        Yes, "extended grapheme clusters" are what I'm apparently interested and what I'd ordinarily call "characters", rather than codepoints.

        I've not looked at Perl 6 yet, but being able to work with Unicode data from a high-level perspective, without caring too much about implementation details such as the various representation layers (the encoding layer that take bytes to codepoints, and then the next one that takes codepoints to "extended grapheme clusters") would be a huge boon for many, including me.

        Well, all perl builtins work at the codepoint level, including length. Depending on your definition of "character", that might or might not be what the OP wants.
        Sure, I'm just saying that bugs or unexpected results can occur if care is not taken. As amon pointed out, the same visual representation of a character with a diacritical might have either one or two codepoints.
        #!/usr/bin/env perl use v5.14; use warnings; use utf8; binmode STDOUT, 'utf8'; my $o_umlaut1 = "\x{F6}"; my $o_umlaut2 = "\x{6F}\x{308}"; my $string1 = "æð" . $o_umlaut1; my $string2 = "æð" . $o_umlaut2; say "length of $string1 is ", length($string1); say "length of $string2 is ", length($string2);
        __OUTPUT__
        length of æðö is 3
        length of æðö is 4
        

        I'll play around with your module. Thai is somewhat unique in that the first combining character may be another alphabetic character, so counting extended graphemes does not necessarily give the correct count of alphabetic characters.

Re: length() miscounting UTF8 characters?
by AppleFritter (Vicar) on Apr 27, 2014 at 21:01 UTC
    And apologies for accidentally posting this anonymously. Eventually I'll get the hang of this site.
Re: length() miscounting UTF8 characters?
by dave_the_m (Monsignor) on Apr 27, 2014 at 22:32 UTC
    The open pragma is documented to affect open() and similar ops within its lexical scope, but you aren't using open; you're using <> to read the already-opened magic ARGV filehandle.

    Dave.

      My tests show that ARGV is affected if you call the script with a file name parameter, but it's not affected if you use it to read data from the standard input.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Thank you for your reply, and good to know! This already came up a little further up; it turns out that there is a way to make the open pragma apply there as well, if you know the right magic incantation. :)

Re: length() miscounting UTF8 characters?
by RonW (Parson) on Apr 27, 2014 at 23:11 UTC
      That looks quite useful, and I'll take a look. Thank you for the pointer, and thanks for replying!
Re: length() miscounting UTF8 characters?
by wjw (Priest) on Apr 27, 2014 at 21:09 UTC
    Would be kind of handy to have a few of those words you are reading in to test against... :-)

    ...the majority is always wrong, and always the last to know about it...
    Insanity: Doing the same thing over and over again and expecting different results...

      Certainly! Here's an excerpt:

      æ æð æða æðaber æðahnútur æðakölkun æðardúnn æðarfugl æðarkolla æðarkóngur æðarvarp æði æðimargur æðisgenginn æðiskast æðislegur æðrast æðri æðrulaus æðruleysi æðruorð æðrutónn æðstur æður æfa

      Which produces the following output:

      2 æ 4 æð 4 æð 4 æð 6 æðú 6 æðö 6 æðú 4 æð 4 æð 6 æðó 4 æð 4 æð 4 æð 4 æð 4 æð 4 æð 4 æð 4 æð 4 æð 4 æð 6 æðð 6 æðó 4 æð 4 æð 2 æ