We don't bite newbies here... much | |
PerlMonks |
Re: Intra-Unicode Conversionsby robartes (Priest) |
on Nov 15, 2006 at 14:51 UTC ( [id://584165]=note: print w/replies, xml ) | Need Help?? |
What the rather convoluted regex in your post does is take a character (in the narrow definition thereof: unsigned char in C speak) in the range 0x80 - 0xFF (your basic 'code page playground' of yore) and convert that to its valid UTF-8 representation. It does this in a very naieve way, BTW, that is only valid if the input text is in the code page whose characters 0x80 - 0xFF correspond to Unicode code points U+0080 to U+00FF, which is to say not many of them. UTF-8 says that any Unicode codepoint in the range U+0080 to U+07FF is encoded in two bytes, with the first three bits (highest order bits) of the first (highest order) byte being 110 and the first two bits of the second byte being 10. The remaining 11 bits are used to store the actual codepoint value. E.g., the character U+00A4 (the currency symbol ¤) is stored as follows:
Note that if the original text was in ISO 8859-15, 0xA4 is the euro symbol € which would be translated to ¤ by the regex. Anyway, the bit twiddling in the sprintf does the UTF8 conversion (I'm using jonadab's representation here):
Please excuse my gratuitous invention of new English verbs. CU
In Section
Seekers of Perl Wisdom
|
|