Syntactic Confectionery Delight | |
PerlMonks |
Can't tell if UTF-8... or just binary...by Kirsle (Pilgrim) |
on Aug 23, 2011 at 18:35 UTC ( [id://921962]=perlquestion: print w/replies, xml ) | Need Help?? |
Kirsle has asked for the wisdom of the Perl Monks concerning the following question:
Hello monks, I have an interesting dilemma on my hands. I'm trying to find a way to determine whether some arbitrary blob of data is text or just binary. I used to have an old "is_binary()" method, which just looks for characters that fall outside of the 127 byte ASCII range, but that doesn't work when the string contains Unicode characters, because the control characters are outside the ASCII range.
Here's a script I'm using for testing, to try to figure out a way to detect whether data is UTF-8 or just random binary:
(The utf8::decode function can be found on another one of my perlmonks posts, JSON, UTF-8 and Filehandles). Seems like the only reliable method I found was just to rely on the is_utf8 flag (and relying on the assumption that most valid strings throughout the code have been properly decoded to have the UTF-8 flag on them). Is there a better way?
Back to
Seekers of Perl Wisdom
|
|