Check UTF8

jai_dgl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Check UTF8 by ikegami (Patriarch) on Oct 22, 2008 at 16:40 UTC
Beware of what you ask for. The following script removes every line that only contains valid UTF-8. `#!/usr/bin/perl use strict; use warnings; use Encode qw( decode ); while (<>) { print if !eval { decode('UTF-8', $_, Encode::FB_CROAK); 1 }; }` [download] Usage: From a file: `remove_utf8_lines.pl infile > outfile` From STDIN: `remove_utf8_lines.pl < infile > outfile` In-place: `perl -i.bak remove_utf8_lines.pl file` A better solution might be to convert the lines to another encoding. `#!/usr/bin/perl use strict; use warnings; binmode(STDIN, ':encoding(UTF-8)'); binmode(STDOUT, ':encoding(iso-latin-1)'); print while <>;` [download] Same usage as the original program.	[reply] [d/l] [select]
Re: Check UTF8 by halley (Prior) on Oct 22, 2008 at 16:32 UTC
What have you tried so far? What did you expect to happen? I am only able to guess on your problem, as you gave very little detail. Do you want to keep lines that only use ASCII? Do you want to keep lines that are not UTF-8 but are valid Latin-1 or ISO-2022-JP or some other encoding? If it really is a matter of ASCII or non-ASCII UTF-8, just reject a line if it includes any character above chr(127). Other encodings will present a bit more challenge. -- `[ e d @ h a l l e y . c c ]`	[reply]
Re: Check UTF8 by JavaFan (Canon) on Oct 22, 2008 at 19:59 UTC
There isn't enough information to write a program that does so. Files are just streams of bytes. And while many bytestreams can be determined to not be valid UTF-8, the reverse isn't true. For instance, if you have a line in the file with bytes E2 A1 B9, is that a line with the three characters LATIN SMALL LETTER A WITH CIRCUMFLEX, INVERTED EXCLAMATION MARK, SUPERSCRIPT ONE (`â¡¹` in Latin-1), or BRAILLE PATTERN DOTS-14567 (`⡹`in UTF-8). And it maybe something different in one of the hundreds of other encodings that are out there. So, while you sometimes can determine that a line isn't UTF-8 (because not every byte sequence is valid UTF-8), you can never be sure a byte sequence is UTF-8 without additional information.	[reply]
Re^2: Check UTF8 by Anonymous Monk on Apr 26, 2011 at 22:00 UTC
True. So tell me: why on earth does the Unicode standard recommend against putting a BOM at the start of a UTF-8 file? Those guys must really like ambiguous data and the quandary it creates for software developers.	[reply]


Your skill will accomplish what the force of many cannot
	PerlMonks