http://qs321.pair.com?node_id=584165


in reply to Intra-Unicode Conversions

What the rather convoluted regex in your post does is take a character (in the narrow definition thereof: unsigned char in C speak) in the range 0x80 - 0xFF (your basic 'code page playground' of yore) and convert that to its valid UTF-8 representation. It does this in a very naieve way, BTW, that is only valid if the input text is in the code page whose characters 0x80 - 0xFF correspond to Unicode code points U+0080 to U+00FF, which is to say not many of them.

UTF-8 says that any Unicode codepoint in the range U+0080 to U+07FF is encoded in two bytes, with the first three bits (highest order bits) of the first (highest order) byte being 110 and the first two bits of the second byte being 10. The remaining 11 bits are used to store the actual codepoint value. E.g., the character U+00A4 (the currency symbol ¤) is stored as follows:

Codepoint U+00A4 --> hex 0xA4 --> binary 10100100 We need to store 10100100 in the UTF-8 bytes: 110..... 10..... We distribute 10100100 over the 'points' in the two bytes: 110 00010 10 100100 So U+00A4 in UTF-8 becomes 1100010 10100100 or 0xc2 0xa4.

Note that if the original text was in ISO 8859-15, 0xA4 is the euro symbol € which would be translated to ¤ by the regex.

Anyway, the bit twiddling in the sprintf does the UTF8 conversion (I'm using jonadab's representation here):

sprintf("%c%c", # Build first byte by OR'ing 0xc0 (binary 11000000) with # the two highest order bits of the character (0xc0 | ($o >> 6)), # Build the second byte by OR'ing 0x80 (binary 10000000) # with the lower 6 bits of the character (obtained by # AND'ing with 0x3f, 00011111) (0x80 | ($o & 0x3f))

Please excuse my gratuitous invention of new English verbs.

CU
Robartes-