Re^2: Unicode: Perl5 equivalent to Perl6's @string.graphemes?

in reply to Re: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
in thread Unicode: Perl5 equivalent to Perl6's @string.graphemes?

I can't speak as to whether the OP will encounter characters an in their decomposed forms or not, but about 40% of both Hiragana and Katakana have multi-code point decomposed forms. "ば" (U+3070, HIRAGANA LETTER BA) can be written as "は" (U+306F, HIRAGANA LETTER HA) plus combining "゛" (U+3099, COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK).

Comment on Re^2: Unicode: Perl5 equivalent to Perl6's @string.graphemes?

Replies are listed 'Best First'.

Re^3: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
by Jim (Curate) on Nov 13, 2010 at 04:05 UTC

Contents of a Unicode (UTF-8) text file named DriedMangos.txt:

	dried mangos
	mangues séchées
	芒果幹
	doraido mangōsu
	ドライドマンゴス
	ドライドマンゴス
	ト"ライト"マンコ"ス

Perl script to demonstrate matching Unicode grapheme clusters using the regular expression backslash sequence \X:

#!perl

use strict;
use warnings;
use autodie;

open my $input_fh,  '<:encoding(UTF-8)', 'DriedMangos.txt';
open my $output_fh, '>:encoding(UTF-8)', 'Graphemes.txt';

while (my $line = <$input_fh>) {
    chomp $line;

    while ($line =~ m/(\X)/g) {
        print $output_fh "[$1]";
    }

    print $output_fh "\n";
}

close $input_fh;
close $output_fh;
[download]

Contents of the output text file named Graphemes.txt:

	[d][r][i][e][d][ ][m][a][n][g][o][s]
	[m][a][n][g][u][e][s][ ][s][é][c][h][é][e][s]
	[芒][果][幹]
	[d][o][r][a][i][d][o][ ][m][a][n][g][ō][s][u]
	[ド][ラ][イ][ド][マ][ン][ゴ][ス]
	[ド][ラ][イ][ド][マ][ン][ゴ][ス]
	[ト]["][ラ][イ][ト]["][マ][ン][コ]["][ス]

(See http://ameblo.jp/gucciman-ikkob/entry-10317490092.html for an explanation of the peculiar last line of the file named DriedMangos.txt.)

[reply]
[d/l]

Re^4: Unicode: Perl5 equivalent to Perl6's @string.graphemes?

by ikegami (Patriarch) on Nov 13, 2010 at 05:44 UTC

(See http://ameblo.jp/gucciman-ikkob/entry-10317490092.html for an explanation of the peculiar last line of the file named DriedMangos.txt.)

Presumably, it says that plain double-quotes were used instead of a combining mark. As such, q{ コ" } is really two graphemes even though it approximates one (q{ ゴ }).

[reply]

Re^5: Unicode: Perl5 equivalent to Perl6's @string.graphemes?

by Jim (Curate) on Nov 13, 2010 at 06:02 UTC

Presumably, it says that plain double-quotes were used instead of a combining mark.

I don't read Japanese, but I presume it says something along the lines of, "Ha! Look at what some goofy print pressman in the Philippines did!"

[reply]

Re^3: Unicode: Perl5 equivalent to Perl6's @string.graphemes?
by Jim (Curate) on Nov 13, 2010 at 05:48 UTC

Perl script to display the contents of the UTF-8 text file named DriedMangos.txt as a list of Unicode code points and character names:

#!perl

use strict;
use warnings;
use autodie;

use Unicode::UCD qw( charinfo );

open my $input_fh, '<:encoding(UTF-8)', 'DriedMangos.txt';

while (my $line = <$input_fh>) {
    chomp $line;

    while ($line =~ m/(.)/g) {
        my $character = $1;
        my $codepoint = ord $character;
        my $charinfo  = charinfo($codepoint);

        my $code = "U+$charinfo->{'code'}";
        my $name = $charinfo->{'name'};

        print "$code $name\n";
    }

    print "\n";
}

close $input_fh;
[download]

The output of the script:

U+0064 LATIN SMALL LETTER D
U+0072 LATIN SMALL LETTER R
U+0069 LATIN SMALL LETTER I
U+0065 LATIN SMALL LETTER E
U+0064 LATIN SMALL LETTER D
U+0020 SPACE
U+006D LATIN SMALL LETTER M
U+0061 LATIN SMALL LETTER A
U+006E LATIN SMALL LETTER N
U+0067 LATIN SMALL LETTER G
U+006F LATIN SMALL LETTER O
U+0073 LATIN SMALL LETTER S

U+006D LATIN SMALL LETTER M
U+0061 LATIN SMALL LETTER A
U+006E LATIN SMALL LETTER N
U+0067 LATIN SMALL LETTER G
U+0075 LATIN SMALL LETTER U
U+0065 LATIN SMALL LETTER E
U+0073 LATIN SMALL LETTER S
U+0020 SPACE
U+0073 LATIN SMALL LETTER S
U+0065 LATIN SMALL LETTER E
U+0301 COMBINING ACUTE ACCENT
U+0063 LATIN SMALL LETTER C
U+0068 LATIN SMALL LETTER H
U+0065 LATIN SMALL LETTER E
U+0301 COMBINING ACUTE ACCENT
U+0065 LATIN SMALL LETTER E
U+0073 LATIN SMALL LETTER S

U+8292 CJK UNIFIED IDEOGRAPH-8292
U+679C CJK UNIFIED IDEOGRAPH-679C
U+5E79 CJK UNIFIED IDEOGRAPH-5E79

U+0064 LATIN SMALL LETTER D
U+006F LATIN SMALL LETTER O
U+0072 LATIN SMALL LETTER R
U+0061 LATIN SMALL LETTER A
U+0069 LATIN SMALL LETTER I
U+0064 LATIN SMALL LETTER D
U+006F LATIN SMALL LETTER O
U+0020 SPACE
U+006D LATIN SMALL LETTER M
U+0061 LATIN SMALL LETTER A
U+006E LATIN SMALL LETTER N
U+0067 LATIN SMALL LETTER G
U+006F LATIN SMALL LETTER O
U+0304 COMBINING MACRON
U+0073 LATIN SMALL LETTER S
U+0075 LATIN SMALL LETTER U

U+30C9 KATAKANA LETTER DO
U+30E9 KATAKANA LETTER RA
U+30A4 KATAKANA LETTER I
U+30C9 KATAKANA LETTER DO
U+30DE KATAKANA LETTER MA
U+30F3 KATAKANA LETTER N
U+30B4 KATAKANA LETTER GO
U+30B9 KATAKANA LETTER SU

U+30C8 KATAKANA LETTER TO
U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
U+30E9 KATAKANA LETTER RA
U+30A4 KATAKANA LETTER I
U+30C8 KATAKANA LETTER TO
U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
U+30DE KATAKANA LETTER MA
U+30F3 KATAKANA LETTER N
U+30B3 KATAKANA LETTER KO
U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
U+30B9 KATAKANA LETTER SU

U+30C8 KATAKANA LETTER TO
U+0022 QUOTATION MARK
U+30E9 KATAKANA LETTER RA
U+30A4 KATAKANA LETTER I
U+30C8 KATAKANA LETTER TO
U+0022 QUOTATION MARK
U+30DE KATAKANA LETTER MA
U+30F3 KATAKANA LETTER N
U+30B3 KATAKANA LETTER KO
U+0022 QUOTATION MARK
U+30B9 KATAKANA LETTER SU
[download]

The Latin characters with diacritics are in Unicode Normalization Form D (NFD). The katakana characters on the fifth line are in Unicode Normalization Form C (NFC). The same katakana characters on the sixth line are in NFD.

[reply]
[d/l]
[select]

In Section Seekers of Perl Wisdom