http://qs321.pair.com?node_id=584165


in reply to Intra-Unicode Conversions

What the rather convoluted regex in your post does is take a character (in the narrow definition thereof: unsigned char in C speak) in the range 0x80 - 0xFF (your basic 'code page playground' of yore) and convert that to its valid UTF-8 representation. It does this in a very naieve way, BTW, that is only valid if the input text is in the code page whose characters 0x80 - 0xFF correspond to Unicode code points U+0080 to U+00FF, which is to say not many of them.

UTF-8 says that any Unicode codepoint in the range U+0080 to U+07FF is encoded in two bytes, with the first three bits (highest order bits) of the first (highest order) byte being 110 and the first two bits of the second byte being 10. The remaining 11 bits are used to store the actual codepoint value. E.g., the character U+00A4 (the currency symbol ¤) is stored as follows:

Codepoint U+00A4 --> hex 0xA4 --> binary 10100100 We need to store 10100100 in the UTF-8 bytes: 110..... 10..... We distribute 10100100 over the 'points' in the two bytes: 110 00010 10 100100 So U+00A4 in UTF-8 becomes 1100010 10100100 or 0xc2 0xa4.

Note that if the original text was in ISO 8859-15, 0xA4 is the euro symbol € which would be translated to ¤ by the regex.

Anyway, the bit twiddling in the sprintf does the UTF8 conversion (I'm using jonadab's representation here):

sprintf("%c%c", # Build first byte by OR'ing 0xc0 (binary 11000000) with # the two highest order bits of the character (0xc0 | ($o >> 6)), # Build the second byte by OR'ing 0x80 (binary 10000000) # with the lower 6 bits of the character (obtained by # AND'ing with 0x3f, 00011111) (0x80 | ($o & 0x3f))

Please excuse my gratuitous invention of new English verbs.

CU
Robartes-

Replies are listed 'Best First'.
Re^2: Intra-Unicode Conversions (naive)
by tye (Sage) on Nov 15, 2006 at 16:37 UTC
    It does this in a very naieve way, BTW, that is only valid if the input text is in the code page whose characters 0x80 - 0xFF correspond to Unicode code points U+0080 to U+00FF, which is to say not many of them.

    And Perl is exactly this naive as well. You can get this exact same result without writing any bit-twiddling Perl code by instead convincing Perl to promote the string to UTF-8, and then storing the resulting bytes into a Perl byte string (or by just turning off the "is UTF-8" bit on that Perl scalar). For example:

    #!/usr/bin/perl -w use strict; require utf8; my $s= pack "C*", 1..255; # Byte string to convert my $u= pack "U*", 1..255; # UTF-8 string my $e= substr($u,0,0); # Empty UTF-8 string my $r= $s; # Convert using regex $r =~ s{ ([^\0-\x7F]) }{ my $o= ord($1); sprintf "%c%c", 0xc0 | ( $o >> 6 ), 0x80 | ( $o & 0x3f ); }gex; my $i= $s.$e; # Convert by implicit upgrade to UTF-8 my $f= $s; # Upgrade via utf8.pm function utf8::upgrade( $f ); my $b= $s; # Upgrade then mark as bytes utf8::encode( $b ); if( $r eq $b ) { print "The regex and utf8::encode() match.\n"; } if( $u eq $i && $i eq $f ) { print "The 3 Unicode strings match.\n"; } if( join(" ",unpack"C*",$r) eq join(" ",unpack"C*",$i) ) { print "The byte- and unicode-strings have the same bytes.\n"; } if( $r ne $i ) { print "The byte- and unicode-strings are not equal.\n"; } print '$s contains ', length($s), " bytes.\n"; print '$i contains ', length($i), " characters.\n"; print '$r contains ', length($r), " bytes.\n";

    Which produces:

    The regex and utf8::encode() match. The 3 Unicode strings match. The byte- and unicode-strings have the same bytes. The byte- and unicode-strings are not equal. $s contains 255 bytes. $i contains 255 characters. $r contains 383 bytes.

    The regex is different in that it doesn't mollest null bytes. If you change "1..255" to "0..255" in the above code, you'll see that when Perl (v5.8.7 on Win32, anyway) converts a byte string to Unicode, it just unceremoniously stops at any bytes of value 0.

    - tye