http://qs321.pair.com?node_id=871173


in reply to Re: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
in thread Unicode: Perl5 equivalent to Perl6's @string.graphemes?

I can't speak as to whether the OP will encounter characters an in their decomposed forms or not, but about 40% of both Hiragana and Katakana have multi-code point decomposed forms. "ば" (U+3070, HIRAGANA LETTER BA) can be written as "は" (U+306F, HIRAGANA LETTER HA) plus combining "゛" (U+3099, COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK).
  • Comment on Re^2: Unicode: Perl5 equivalent to Perl6's @string.graphemes?

Replies are listed 'Best First'.
Re^3: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
by Jim (Curate) on Nov 13, 2010 at 04:05 UTC

    Contents of a Unicode (UTF-8) text file named DriedMangos.txt:

    dried mangos
    mangues séchées
    芒果幹
    doraido mangōsu
    ドライドマンゴス
    ドライドマンゴス
    ト"ライト"マンコ"ス

    Perl script to demonstrate matching Unicode grapheme clusters using the regular expression backslash sequence \X:

    #!perl use strict; use warnings; use autodie; open my $input_fh, '<:encoding(UTF-8)', 'DriedMangos.txt'; open my $output_fh, '>:encoding(UTF-8)', 'Graphemes.txt'; while (my $line = <$input_fh>) { chomp $line; while ($line =~ m/(\X)/g) { print $output_fh "[$1]"; } print $output_fh "\n"; } close $input_fh; close $output_fh;

    Contents of the output text file named Graphemes.txt:

    [d][r][i][e][d][ ][m][a][n][g][o][s]
    [m][a][n][g][u][e][s][ ][s][é][c][h][é][e][s]
    [芒][果][幹]
    [d][o][r][a][i][d][o][ ][m][a][n][g][ō][s][u]
    [ド][ラ][イ][ド][マ][ン][ゴ][ス]
    [ド][ラ][イ][ド][マ][ン][ゴ][ス]
    [ト]["][ラ][イ][ト]["][マ][ン][コ]["][ス]

    (See http://ameblo.jp/gucciman-ikkob/entry-10317490092.html for an explanation of the peculiar last line of the file named DriedMangos.txt.)

      (See http://ameblo.jp/gucciman-ikkob/entry-10317490092.html for an explanation of the peculiar last line of the file named DriedMangos.txt.)

      Presumably, it says that plain double-quotes were used instead of a combining mark. As such, q{ コ" } is really two graphemes even though it approximates one (q{ ゴ }).

        Presumably, it says that plain double-quotes were used instead of a combining mark.

        I don't read Japanese, but I presume it says something along the lines of, "Ha! Look at what some goofy print pressman in the Philippines did!"

Re^3: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
by Jim (Curate) on Nov 13, 2010 at 05:48 UTC

    Perl script to display the contents of the UTF-8 text file named DriedMangos.txt as a list of Unicode code points and character names:

    #!perl use strict; use warnings; use autodie; use Unicode::UCD qw( charinfo ); open my $input_fh, '<:encoding(UTF-8)', 'DriedMangos.txt'; while (my $line = <$input_fh>) { chomp $line; while ($line =~ m/(.)/g) { my $character = $1; my $codepoint = ord $character; my $charinfo = charinfo($codepoint); my $code = "U+$charinfo->{'code'}"; my $name = $charinfo->{'name'}; print "$code $name\n"; } print "\n"; } close $input_fh;

    The output of the script:

    U+0064 LATIN SMALL LETTER D U+0072 LATIN SMALL LETTER R U+0069 LATIN SMALL LETTER I U+0065 LATIN SMALL LETTER E U+0064 LATIN SMALL LETTER D U+0020 SPACE U+006D LATIN SMALL LETTER M U+0061 LATIN SMALL LETTER A U+006E LATIN SMALL LETTER N U+0067 LATIN SMALL LETTER G U+006F LATIN SMALL LETTER O U+0073 LATIN SMALL LETTER S U+006D LATIN SMALL LETTER M U+0061 LATIN SMALL LETTER A U+006E LATIN SMALL LETTER N U+0067 LATIN SMALL LETTER G U+0075 LATIN SMALL LETTER U U+0065 LATIN SMALL LETTER E U+0073 LATIN SMALL LETTER S U+0020 SPACE U+0073 LATIN SMALL LETTER S U+0065 LATIN SMALL LETTER E U+0301 COMBINING ACUTE ACCENT U+0063 LATIN SMALL LETTER C U+0068 LATIN SMALL LETTER H U+0065 LATIN SMALL LETTER E U+0301 COMBINING ACUTE ACCENT U+0065 LATIN SMALL LETTER E U+0073 LATIN SMALL LETTER S U+8292 CJK UNIFIED IDEOGRAPH-8292 U+679C CJK UNIFIED IDEOGRAPH-679C U+5E79 CJK UNIFIED IDEOGRAPH-5E79 U+0064 LATIN SMALL LETTER D U+006F LATIN SMALL LETTER O U+0072 LATIN SMALL LETTER R U+0061 LATIN SMALL LETTER A U+0069 LATIN SMALL LETTER I U+0064 LATIN SMALL LETTER D U+006F LATIN SMALL LETTER O U+0020 SPACE U+006D LATIN SMALL LETTER M U+0061 LATIN SMALL LETTER A U+006E LATIN SMALL LETTER N U+0067 LATIN SMALL LETTER G U+006F LATIN SMALL LETTER O U+0304 COMBINING MACRON U+0073 LATIN SMALL LETTER S U+0075 LATIN SMALL LETTER U U+30C9 KATAKANA LETTER DO U+30E9 KATAKANA LETTER RA U+30A4 KATAKANA LETTER I U+30C9 KATAKANA LETTER DO U+30DE KATAKANA LETTER MA U+30F3 KATAKANA LETTER N U+30B4 KATAKANA LETTER GO U+30B9 KATAKANA LETTER SU U+30C8 KATAKANA LETTER TO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+30E9 KATAKANA LETTER RA U+30A4 KATAKANA LETTER I U+30C8 KATAKANA LETTER TO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+30DE KATAKANA LETTER MA U+30F3 KATAKANA LETTER N U+30B3 KATAKANA LETTER KO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+30B9 KATAKANA LETTER SU U+30C8 KATAKANA LETTER TO U+0022 QUOTATION MARK U+30E9 KATAKANA LETTER RA U+30A4 KATAKANA LETTER I U+30C8 KATAKANA LETTER TO U+0022 QUOTATION MARK U+30DE KATAKANA LETTER MA U+30F3 KATAKANA LETTER N U+30B3 KATAKANA LETTER KO U+0022 QUOTATION MARK U+30B9 KATAKANA LETTER SU

    The Latin characters with diacritics are in Unicode Normalization Form D (NFD). The katakana characters on the fifth line are in Unicode Normalization Form C (NFC). The same katakana characters on the sixth line are in NFD.