comment on

Also keep in mind that a perl "string" does not need to be a single encoding for all of its content.

Think XML and CSV where parts can be real binary and parts can be encoded.

Upgrading/downgrading the complete string before processing (either in pure perl or in XS) will cause data-corruption.

One more thing to keep in mind with codepoints is that Unicode allows a lot.

e.g. U+001e2f (LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE) can be encoded in UTF-8 as e1 b8 af, c3 af cc 81, c3 ad cc 88, 69 cc 81 cc 88, or 69 cc 88 cc 81, all representing the same glyph. At the moment of writing, perl does not alter any of that, but when Unicode Normalization rules would apply on a semantic level, your world view changes:

(update: I meanwhile learned that the order of the diacriticals can be meaningful, which is why two of the examples below do not normalize to U+001e2f)

#!/usr/bin/perl

use 5.18.2;
use warnings;

use Data::Peek;
use Unicode::Normalize qw( normalize );
use Encode             qw( encode decode );
use charnames          qw(:full);

sub dp {
    my ($tag, $dta) = @_;
    my $dp = DPeek ($dta);
    printf "%-6s: %-52s", $tag, $dp =~ s{^(\S+)\K}{" " x (26 - length 
+$1)}er;
    utf8::is_utf8 ($dta) and
        print join " + " => map { charnames::viacode (ord) } split // 
+=> $dta;
    say "";
    } # dp

$| = 1;
foreach my $bytes (
        "\xe1\xb8\xaf",
        "\xc3\xaf\xcc\x81",
        "\xc3\xad\xcc\x88",
        "\x69\xcc\x81\xcc\x88",
        "\x69\xcc\x88\xcc\x81",
        ) {
    my $u = decode ("utf-8", $bytes);
    dp ("Bytes", $bytes);
    dp ("UTF-8", $u);
    dp ("NF$_", normalize ($_, $u)) for qw( D C KD KC );
    say "";
    }
[download]

Bytes : PV("\341\270\257"\0)
UTF-8 : PV("\341\270\257"\0)       [UTF8 "\x{1e2f}"]        LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
NFD   : PV("i\314\210\314\201"\0)  [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFC   : PV("\341\270\257"\0)       [UTF8 "\x{1e2f}"]        LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
NFKD  : PV("i\314\210\314\201"\0)  [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFKC  : PV("\341\270\257"\0)       [UTF8 "\x{1e2f}"]        LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE

Bytes : PV("\303\257\314\201"\0)
UTF-8 : PV("\303\257\314\201"\0)   [UTF8 "\x{ef}\x{301}"]   LATIN SMAL
+L LETTER I WITH DIAERESIS + COMBINING ACUTE ACCENT
NFD   : PV("i\314\210\314\201"\0)  [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFC   : PV("\341\270\257"\0)       [UTF8 "\x{1e2f}"]        LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
NFKD  : PV("i\314\210\314\201"\0)  [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFKC  : PV("\341\270\257"\0)       [UTF8 "\x{1e2f}"]        LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE

Bytes : PV("\303\255\314\210"\0)
UTF-8 : PV("\303\255\314\210"\0)   [UTF8 "\x{ed}\x{308}"]   LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS
NFD   : PV("i\314\201\314\210"\0)  [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFC   : PV("\303\255\314\210"\0)   [UTF8 "\x{ed}\x{308}"]   LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS
NFKD  : PV("i\314\201\314\210"\0)  [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFKC  : PV("\303\255\314\210"\0)   [UTF8 "\x{ed}\x{308}"]   LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS

Bytes : PV("i\314\201\314\210"\0)
UTF-8 : PV("i\314\201\314\210"\0)  [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFD   : PV("i\314\201\314\210"\0)  [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFC   : PV("\303\255\314\210"\0)   [UTF8 "\x{ed}\x{308}"]   LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS
NFKD  : PV("i\314\201\314\210"\0)  [UTF8 "i\x{301}\x{308}"] LATIN SMAL
+L LETTER I + COMBINING ACUTE ACCENT + COMBINING DIAERESIS
NFKC  : PV("\303\255\314\210"\0)   [UTF8 "\x{ed}\x{308}"]   LATIN SMAL
+L LETTER I WITH ACUTE + COMBINING DIAERESIS

Bytes : PV("i\314\210\314\201"\0)
UTF-8 : PV("i\314\210\314\201"\0)  [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFD   : PV("i\314\210\314\201"\0)  [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFC   : PV("\341\270\257"\0)       [UTF8 "\x{1e2f}"]        LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
NFKD  : PV("i\314\210\314\201"\0)  [UTF8 "i\x{308}\x{301}"] LATIN SMAL
+L LETTER I + COMBINING DIAERESIS + COMBINING ACUTE ACCENT
NFKC  : PV("\341\270\257"\0)       [UTF8 "\x{1e2f}"]        LATIN SMAL
+L LETTER I WITH DIAERESIS AND ACUTE
[download]

Enjoy, Have FUN! H.Merijn

In reply to Re: What does utf8::upgrade actually do. by Tux
in thread What does utf8::upgrade actually do. by syphilis

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


No such thing as a small change
	PerlMonks