I'm trying to get my head around some Unicode issues:
I want to compare text strings at the grapheme level, and I thought that the way to do it is to use one of the normalization form provided by Unicode::Normalize.
I tried the four most common normalization forms, but different representations of the same grapheme (an "a" with two marking characters in different order) are never converted to the same form.
#!/usr/bin/perl
use strict;
use warnings;
use charnames qw(:full);
use Unicode::Normalize qw(NFKD NFD NFKC NFC reorder);
my $str1 = "\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}\N{COMBI
+NING GRAVE ACCENT}";
my $str2 = "\N{LATIN SMALL LETTER A}\N{COMBINING GRAVE ACCENT}\N{COMBI
+NING ACUTE ACCENT}";
binmode STDOUT, ':utf8';
print "success\n" if $str1 eq $str2;
print "NFKD:\n";
dump_charnames(NFKD($str1));
dump_charnames(NFKD($str2));
print "success\n" if NFKD($str1) eq NFKD($str2);
print "NFD:\n";
dump_charnames(NFD($str1));
dump_charnames(NFD($str2));
print "success\n" if NFD($str1) eq NFD($str2);
print "NFC:\n";
dump_charnames(NFC($str1));
dump_charnames(NFC($str2));
print "success\n" if NFC($str1) eq NFC($str2);
print "NFKC:\n";
dump_charnames(NFKC($str1));
dump_charnames(NFKC($str2));
print "success\n" if NFKC($str1) eq NFKC($str2);
print "reorder:\n";
dump_charnames(reorder($str1));
dump_charnames(reorder($str2));
print "success\n" if reorder($str1) eq reorder($str2);
sub dump_charnames {
my $str = $_[0];
print map { '\N{' . charnames::viacode(ord $_) . '}' }
split m//, $str;
print "\n";
}
__END__
\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}\N{COMBINING GRAVE A
+CCENT}
\N{LATIN SMALL LETTER A}\N{COMBINING GRAVE ACCENT}\N{COMBINING ACUTE A
+CCENT}
NFD:
\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}\N{COMBINING GRAVE A
+CCENT}
\N{LATIN SMALL LETTER A}\N{COMBINING GRAVE ACCENT}\N{COMBINING ACUTE A
+CCENT}
NFC:
\N{LATIN SMALL LETTER A WITH ACUTE}\N{COMBINING GRAVE ACCENT}
\N{LATIN SMALL LETTER A WITH GRAVE}\N{COMBINING ACUTE ACCENT}
NFKC:
\N{LATIN SMALL LETTER A WITH ACUTE}\N{COMBINING GRAVE ACCENT}
\N{LATIN SMALL LETTER A WITH GRAVE}\N{COMBINING ACUTE ACCENT}
reorder:
\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}\N{COMBINING GRAVE A
+CCENT}
\N{LATIN SMALL LETTER A}\N{COMBINING GRAVE ACCENT}\N{COMBINING ACUTE A
+CCENT}
What am I doing wrong, and how can I transform the strings to a canonical form that is fit for comparison?
Or is it just this weird grapheme which doesn't occur "in the wild" and causes problems?