http://qs321.pair.com?node_id=1084067


in reply to length() miscounting UTF8 characters?

Here's a Perl script that counts the number of bytes, code points, and graphemes in each UTF-8 encoded word. It also tallies the code points by Unicode blocks.

#!perl

use v5.14;
use strict;
use warnings;
use utf8;

use Encode qw( encode_utf8 );
use Unicode::UCD qw( charblock );

binmode STDOUT, ':encoding(UTF-8)';

while (my $word = <DATA>) {
    chomp $word;

    my $length_in_bytes       = length_in_bytes($word);
    my $length_in_code_points = length_in_code_points($word);
    my $length_in_graphemes   = length_in_graphemes($word);
    my $code_points_in_blocks = code_points_in_blocks($word);

    printf "%-12s | Bytes: %2d | Code Points: %2d | Graphemes: %2d | Blocks: %s\n",
        $word,
        $length_in_bytes,
        $length_in_code_points,
        $length_in_graphemes,
        $code_points_in_blocks;
}

exit 0;

sub length_in_bytes {
    my $word = shift;

    my $length = length encode_utf8($word);

    return $length;
}

sub length_in_code_points {
    my $word = shift;

    my $length = length $word;

    return $length;
}

sub length_in_graphemes {
    my $word = shift;

    my $length = () = $word =~ m/\X/g;

    return $length;
}

sub code_points_in_blocks {
    my $word = shift;

    my %total_code_points_by;
    my $blocks = '';

    for my $character (split m//, $word) {
        my $block = charblock(ord $character);

        $total_code_points_by{$block}++;
    }

    for my $block (sort keys %total_code_points_by) {
        my $total = $total_code_points_by{$block};

        $blocks .= sprintf "%s%s (%d)",
                   (length $blocks ? ', ' : ''), $block, $total;
    }

    return $blocks;
}

__DATA__
æ
æð
æða
æðaber
æðahnútur
æðakölkun
æðardúnn
æðarfugl
æðarkolla
æðarkóngur
æðarvarp
æði
æðimargur
æðisgenginn
æðiskast
æðislegur
æðrast
æðri
æðrulaus
æðruleysi
æðruorð
æðrutónn
æðstur
æður
æfa

Here's the output of the script.

æ            | Bytes:  2 | Code Points:  1 | Graphemes:  1 | Blocks: Latin-1 Supplement (1)
æð           | Bytes:  4 | Code Points:  2 | Graphemes:  2 | Blocks: Latin-1 Supplement (2)
æða          | Bytes:  5 | Code Points:  3 | Graphemes:  3 | Blocks: Basic Latin (1), Latin-1 Supplement (2)
æðaber       | Bytes:  8 | Code Points:  6 | Graphemes:  6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
æðahnútur    | Bytes: 12 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (6), Latin-1 Supplement (3)
æðakölkun    | Bytes: 12 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (6), Latin-1 Supplement (3)
æðardúnn     | Bytes: 11 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (5), Latin-1 Supplement (3)
æðarfugl     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
æðarkolla    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
æðarkóngur   | Bytes: 13 | Code Points: 10 | Graphemes: 10 | Blocks: Basic Latin (7), Latin-1 Supplement (3)
æðarvarp     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
æði          | Bytes:  5 | Code Points:  3 | Graphemes:  3 | Blocks: Basic Latin (1), Latin-1 Supplement (2)
æðimargur    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
æðisgenginn  | Bytes: 13 | Code Points: 11 | Graphemes: 11 | Blocks: Basic Latin (9), Latin-1 Supplement (2)
æðiskast     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
æðislegur    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
æðrast       | Bytes:  8 | Code Points:  6 | Graphemes:  6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
æðri         | Bytes:  6 | Code Points:  4 | Graphemes:  4 | Blocks: Basic Latin (2), Latin-1 Supplement (2)
æðrulaus     | Bytes: 10 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (6), Latin-1 Supplement (2)
æðruleysi    | Bytes: 11 | Code Points:  9 | Graphemes:  9 | Blocks: Basic Latin (7), Latin-1 Supplement (2)
æðruorð      | Bytes: 10 | Code Points:  7 | Graphemes:  7 | Blocks: Basic Latin (4), Latin-1 Supplement (3)
æðrutónn     | Bytes: 11 | Code Points:  8 | Graphemes:  8 | Blocks: Basic Latin (5), Latin-1 Supplement (3)
æðstur       | Bytes:  8 | Code Points:  6 | Graphemes:  6 | Blocks: Basic Latin (4), Latin-1 Supplement (2)
æður         | Bytes:  6 | Code Points:  4 | Graphemes:  4 | Blocks: Basic Latin (2), Latin-1 Supplement (2)
æfa          | Bytes:  4 | Code Points:  3 | Graphemes:  3 | Blocks: Basic Latin (2), Latin-1 Supplement (1)

UPDATE:  If you add these three words to the end of the list in the __DATA__ block of the the UTF-8 encoded Perl script…

한국말
piñón
piñón

…then the report will include these three lines…

한국말          | Bytes:  9 | Code Points:  3 | Graphemes:  3 | Blocks: Hangul Syllables (3)
piñón        | Bytes:  7 | Code Points:  5 | Graphemes:  5 | Blocks: Basic Latin (3), Latin-1 Supplement (2)
piñón      | Bytes:  9 | Code Points:  7 | Graphemes:  5 | Blocks: Basic Latin (5), Combining Diacritical Marks (2)

Replies are listed 'Best First'.
Re^2: length() miscounting UTF8 characters?
by AppleFritter (Vicar) on Apr 28, 2014 at 09:37 UTC
    Wow, I don't know what to say, that script is extremely helpful and should come in very handy! Thanks a bunch, I really appreciate the effort you went to there. I never expected this much useful feedback when I turned to PM at a friend's suggestion. So again, thanks to you and everyone else, I'm really impressed.

      Bear in mind that the script is written using very didactic code. It's longer and more verbose than the same script would be if its main purpose wasn't to teach a lesson.