good chemistry is complicated, and a little bit messy -LW |
|
PerlMonks |
Re: utf8 characters in tr/// or s///by graff (Chancellor) |
on Oct 01, 2008 at 03:32 UTC ( [id://714701]=note: print w/replies, xml ) | Need Help?? |
When you fetch utf8 texrt from mysql, you should always run it through Encode::decode("utf8",...) -- update: or equivalent, as shown by ikegami -- so that perl has a valid utf8 string with the "utf8" flag turned on. Then, you can do lots of useful things using normal perl string operations.
For example, here's a neat and easy way to eliminate all diacritic marks that come attached to Alas, that form of normalization does not convert "ø" to "o", or "Æ" to "AE", or "ß" to "ss", etc. That is, there may still be non-ascii characters in the final result, depending on what you have in your database, and for stuff like that, you'll just have to face the task of defining what sort of behavior you really want (e.g. just strip them out, or define an explicit list of replacements, or...) In case it might help, it's easy to get an inventory of the characters you have in the database, so that you can see which ones, if any, need special attention beyond just stripping diacritic marks. I posted a little tool here that shows one way to do that: unichist -- count/summarize characters in data. One other caveat about that normalization process: for a number of languages (e.g. those that use Arabic, Hebrew, Devanagari, or other non-Latin scripts with diacritic marks), you may want/need to apply "NFC" normalization (also provided by Unicode::Normalize) after doing "NFD" and Latin diacritic removal, so that you "recompose" the non-Latin characters and diacritics into their "canonical" combined-character forms. (update; having just seen ikegami's point about the "utf8::" functions, I agree -- that's a fine alternative to "use Encode".)
In Section
Seekers of Perl Wisdom
|
|