Also keep in mind that a perl "string" does not need to be a single encoding for all of its content.
Think XML and CSV where parts can be real binary and parts can be encoded.
Upgrading/downgrading the complete string before processing (either in pure perl or in XS) will cause data-corruption.
One more thing to keep in mind with codepoints is that Unicode allows a lot.
e.g. U+001e2f (LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE) can be encoded in UTF-8 as e1 b8 af, c3 af cc 81, c3 ad cc 88, 69 cc 81 cc 88, or 69 cc 88 cc 81, all representing the same glyph. At the moment of writing, perl does not alter any of that, but when Unicode Normalization rules would apply on a semantic level, your world view changes:
(update: I meanwhile learned that the order of the diacriticals can be meaningful, which is why two of the examples below do not normalize to U+001e2f)
#!/usr/bin/perl
use 5.18.2;
use warnings;
use Data::Peek;
use Unicode::Normalize qw( normalize );
use Encode qw( encode decode );
use charnames qw(:full);
sub dp {
my ($tag, $dta) = @_;
my $dp = DPeek ($dta);
printf "%-6s: %-52s", $tag, $dp =~ s{^(\S+)\K}{" " x (26 - length
+$1)}er;
utf8::is_utf8 ($dta) and
print join " + " => map { charnames::viacode (ord) } split //
+=> $dta;
say "";
} # dp
$| = 1;
foreach my $bytes (
"\xe1\xb8\xaf",
"\xc3\xaf\xcc\x81",
"\xc3\xad\xcc\x88",
"\x69\xcc\x81\xcc\x88",
"\x69\xcc\x88\xcc\x81",
) {
my $u = decode ("utf-8", $bytes);
dp ("Bytes", $bytes);
dp ("UTF-8", $u);
dp ("NF$_", normalize ($_, $u)) for qw( D C KD KC );
say "";
}
->
Bytes : PV("\341\270\257"\0)
UTF-8 : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
NFD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
NFKD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFKC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
Bytes : PV("\303\257\314\201"\0)
UTF-8 : PV("\303\257\314\201"\0) [UTF8 "\x{ef}\x{301}"] LATIN SMAL
+L LETTER I WITH DIAERESIS + COMBINING ACUTE ACCENT
NFD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
NFKD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFKC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
Bytes : PV("\303\255\314\210"\0)
UTF-8 : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS
NFD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS
NFKD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFKC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS
Bytes : PV("i\314\201\314\210"\0)
UTF-8 : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS
NFKD : PV("i\314\201\314\210"\0) [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFKC : PV("\303\255\314\210"\0) [UTF8 "\x{ed}\x{308}"] LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS
Bytes : PV("i\314\210\314\201"\0)
UTF-8 : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
NFKD : PV("i\314\210\314\201"\0) [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFKC : PV("\341\270\257"\0) [UTF8 "\x{1e2f}"] LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
Enjoy, Have FUN! H.Merijn
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.