Re^5: Handling malformed UTF-16 data with PerlIO layer

(e.g. what are private-use high-surrogates, really? ...and who knows what else there might be).

There is no such thing as "private-use high-surrogates". There is a region of the unicode space reserved for "private use" (from E000 thru F8FF), and there is the region set aside for "surrogates" (from D800 thru DFFF). There's also a "supplementary private use" area running from F0000 - 10FFFF, which is not relevant here (note the extra digits).

There is no "supplemental surrogates" area -- the surrogate region is "special" and unique, reserved specifically so that UTF-16 encodings have a way of representing code points above FFFF (in much the same way that byte-oriented utf8 handles code points above FF).

In effect, UTF-16 is a "variable-width" encoding in the case where code points above FFFF are being used -- such "higher-plane" code points must be expressed via two UTF-16 values. Since the very highest Unicode code point is 10FFFF (21 bits), and since the high 5 bits are only used for 16 distinct "upper planes" (01....-10...., hence 4 bits worth), the surrogate region provides for the 20 "significant" bits to be split over two 16-bit words, where the high 6 bits of each word are rigidly fixed: first word of a surrogate pair must have 110110 (D800-DBFF for the "High" 10 bits), second word must have 110111 (DC00-DFFF for the "Low" 10 bits).

This serves to explain why you cannot convert a 16-bit value in the surrogate range into a utf8 character -- no characters (no code points) can be defined within that range of 16-bit values. But when a code point above FFFF is correctly encoded into UTF-16, you get surrogates (a pair of 16-bit values, one each in the "High" and "Low" regions of the surrogate range).

Regarding ikegami's observation about FFFE and FFFF, I noticed that this is a difference between 5.8.8 and 5.10.0 -- Encode handles these code points in 5.8 but it spits out the error in 5.10. It's certainly true that Unicode explicitly reserves these values as "non-characters." I'm not sure whether 5.8 or 5.10 has the better approach, and I sort of expect that it might depend on the circumstances. I looked for something about this in perldelta, but didn't see anything explicit.

In addition to those two "non-character" code points, the same result applies to the range FDD0 - FDEF. According to the unicode reference page, "These codes are intended for process-internal uses, but are not permitted for interchange." I don't really know what ~~that~~ process-internal uses means (but not permitted for interchange seems pretty clear).

In any case, here's a test script for identifying all the unsavory (error-inducing) 16-bit values -- you can run this in both 5.8.8 and 5.10.0 to see how the two versions differ in their behavior.

I think the "eval" technique here might be a decent approach for what you need to do with your data -- I'm afraid you'll need to ditch the idea of using the PerlIO::encoding layer, and should probably go with reading into a fixed-sized buffer, Check out the description of FB_WARN in the Encode man page, because it handles the case where you are doing fixed-size buffer reads and get a partial character at the end of a given buffer.

use Encode;
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";

for (0x0..0xffff) {
    $c = pack( "v", $_ );
    eval { $u = decode( "UTF-16LE", $c, Encode::FB_WARN ) };
    if ( $@ ) {
        warn $@;
        print "\x{FEFF}\n";
    }
    else {
        $u = '\\n' if ( $u eq "\n" );  # just so LF doesn't show up as
+ two lines
        print "$u\n";
    }
}
[download]

Comment on Re^5: Handling malformed UTF-16 data with PerlIO layer Download Code

Replies are listed 'Best First'.
Re^6: Handling malformed UTF-16 data with PerlIO layer by almut (Canon) on Oct 28, 2008 at 20:12 UTC
There is no such thing as "private-use high-surrogates". Well, I was referring to (quote from p. 548, section 16.6, Unicode Standard v5.0 — which I linked to in the original post): Private-Use High-Surrogates. The high-surrogate code points from U+DB80..U+DBFF are private-use high-surrogate code points (a total of 128 code points). Characters represented by means of a surrogate pair, where the high-surrogate code point is a private-use high-surrogate, are private-use characters from the supplementary private use areas. For more information on private-use characters, see Section 16.5, Private-Use Characters. though I wasn't just referring to those 128 code points, but rather to the wider context of the respective surrogate pairs, and how they would be used in practice. Anyhow, things like you (an expert) denying the existence of private-use high-surrogates, kinda confirms what I'm saying :) Encodings like UTF-16 are non-trivial enough for me to not necessarily want to get into every detail of it if there's some way around (though, as it looks, there doesn't seem to be...). Rather, I'd like to rely on the good work already done within Perl by people like our honorable Juerd. After all, what's the point of having support for unicode and other encodings in Perl, if you then write your own parsers from scratch? `eval { $u = decode( "UTF-16LE", $c, Encode::FB_WARN ) };` [download] Interestingly, the current Encode docs note Handling Malformed Data The optional CHECK argument tells Encode what to do when it encounters malformed data. (...) NOTE: Not all encodings support this feature. Some encodings ignore CHECK argument. For example, Encode::Unicode ignores CHECK and it always croaks on error. (Encode::Unicode implements unicode encodings like UTF-16) AFAICT, this is partially true. That is, the CHECK argument appears to honor the value FB_DEFAULT, but croaks with anything else, which would explain why - with UTF-16 - FB_QUIET and FB_WARN do not quite produce the behavior you'd expect from reading the description of those constants...	[reply] [d/l]
Re^7: Handling malformed UTF-16 data with PerlIO layer by graff (Chancellor) on Oct 28, 2008 at 22:24 UTC
Private-Use High-Surrogates. The high-surrogate code points from U+DB80..U+DBFF are private-use high-surrogate code points (a total of 128 code points). Oh yeah, that's true -- I was just taking the viewpoint that the "surrogate range" as a block (as it relates to potential encoding errors) does not really need to be broken into the parts that map to the "supplemental private-use area", because this area is just part of the "higher planes" in the unicode space, and is addressed by surrogates in the same way as all the other planes above FFFF. (Encode::Unicode implements unicode encodings like UTF-16) Thanks for clarifying that -- this thread has been very educational for me.	[reply]
Re^6: Handling malformed UTF-16 data with PerlIO layer by ikegami (Patriarch) on Oct 28, 2008 at 10:13 UTC
Regarding ikegami's observation about FFFE and FFFF, I noticed that this is a difference between 5.8.8 and 5.10.0 -- Encode handles these code points in 5.8 but it spits out the error in 5.10. The error messages I got were from 5.8.8. I don't see any different between 5.8 and 5.10. `>c:\progs\perl580\bin\perl -e"print qq{\xFE\xFF}" \| perl -e"binmode ST +DIN, ':encoding(UTF-16le)'; <>" UTF-16LE:Unicode character fffe is illegal at -e line 1. >c:\progs\perl588\bin\perl -e"print qq{\xFE\xFF}" \| perl -e"binmode ST +DIN, ':encoding(UTF-16le)'; <>" UTF-16LE:Unicode character fffe is illegal at -e line 1. >c:\progs\perl5100\bin\perl -e"print qq{\xFE\xFF}" \| perl -e"binmode S +TDIN, ':encoding(UTF-16le)'; <>" UTF-16LE:Unicode character fffe is illegal at -e line 1.` [download]	[reply] [d/l]


Keep It Simple, Stupid
	PerlMonks