By the shine on my bald pate, I dislike this encoding stuff

jfrm has asked for the wisdom of the Perl Monks concerning the following question:

I spent most of last week wading/stumbling through a conversion of a large part of my system to UTF-8. After much reading and inexplicableness, I'm virtually there. Still have one problem though; one routine expects UTF-8 data and can crash if it gets some from a non-UTF-8 file. So I need to test a file for UTF-8-edness. I have read much documentation/Perlmonks/Stackoverflow and apparently the following should work:

open (ORDERFILE, '<:encoding(UTF-8)', $emailfile) or return (@err, "Co
+uld not open order email file: $emailfile");
my(@LINES) = <ORDERFILE>;
my $filedata = <ORDERFILE>;
close(ORDERFILE);
use Encode;
eval { my $utf8 = decode("utf8", $filedata, Encode::FB_CROAK ) };
return(@err, "File was not encoded in UTF-8") if ($@);
[download]

But I have ANSII files for which this doesn't return but just outputs lots of warnings such as: utf8 "\xA3" does not map to Unicode. If I remove the '<:encoding(UTF-8)', argument from 'open', it still works but there are no warnings. A salient insight would be a welcome relief if there are any ideas?

Comment on By the shine on my bald pate, I dislike this encoding stuff Select or Download Code

Replies are listed 'Best First'.
Re: By the shine on my bald pate, I dislike this encoding stuff by haukex (Archbishop) on Mar 04, 2018 at 11:52 UTC
In addition to the issue poj pointed out with reading from `<ORDERFILE>` twice (`my(@LINES) = <ORDERFILE>` reads all lines from the file, so `$filedata` would normally be empty), I just wanted to point out that the pattern `eval {...}; if ($@) {...}` has issues and that the pattern `eval {...; 1} or do {...}` or a module like Try::Tiny is better. Also, nowadays lexical filehandles (`open my $fh, ...`) are generally preferred over bareword filehandles (`open ORDERFILE, ...`). (Update: The AM also made a good point that you appear to be decoding the data twice.) Really, the best way to go is to know in advance what encoding your files are in, and then opening them with the appropriate encoding in `open my $fh, '<:encoding(...)', $filename or die $!;` You may want to have a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) Read more... some code (1222 Bytes)	[reply] [d/l] [select]
Re: By the shine on my bald pate, I dislike this encoding stuff by Anonymous Monk on Mar 04, 2018 at 14:24 UTC
Strictly speaking, a file containing "\xA3" is not ASCII, since ASCII only consists of the characters from "\x00" to "\x7F". Maybe it's ISO Latin-1? Also, your logic double-decodes the file. Assuming it is UTF-8, opening it '<:encoding(UTF-8)' decodes it, and then your `decode()` decodes it again. My knee-jerk would be to apply `Encode::Guess` to the problem, since that way somebody else has worked out this mess for you, and since if you are going to convert the file to UTF-8 you need to know what its encoding currently is. If I just wanted to know if the file decoded as UTF-8 I might be lazy and do something like open my $orderfile, '<:raw', $emailfile or return( @err, "Could not open $emailfile: $!" ); local $/ = undef; my $filedata = <$orderfile>; close $orderfile; use Encode; eval { decode( "utf-8", $filedata, Encode::FB_CROAK ); 1; } or return( @err, "File was not encoded in UTF-8" ); One possible source of confusion in this horrible mess is that the ASCII encoding is a subset of the UTF-8 encoding, so technically there is no way to distinguish between a file encoded in ASCII and a file encoded in UTF-8	[reply] [d/l] [select]
Re^2: By the shine on my bald pate, I dislike this encoding stuff by Anonymous Monk on Mar 05, 2018 at 03:39 UTC
Yep. Betcha the real problem is that the files which contain "non-ASCII characters" didn't use Unicode (UTF-8, UTF-16) to encode those characters, but instead used old-style code pages. But the program's logic assumes that it's Unicode without checking the entire file. I didn't see the OP ever describing what the nature of the "crash" actually is.	[reply]
Re: By the shine on my bald pate, I dislike this encoding stuff by Anonymous Monk on Mar 04, 2018 at 08:00 UTC
Encode::Guess Encode::Detect	[reply]
Re^2: By the shine on my bald pate, I dislike this encoding stuff by jfrm (Monk) on Mar 04, 2018 at 08:37 UTC
This doesn't explain why it doesn't work but OK, thanks, I have added: `use Encode::Guess; my $decoder = Encode::Guess->guess($filedata); # default detectable e +ncodings include utf8. return(@err, "Can't guess encoding: $decoder") unless ref($decoder);` [download] This is still no good as it fails for all files, ANSII and UTF-8 with error `Can't guess encoding: Empty string, empty guess.`	[reply] [d/l] [select]
Re^3: By the shine on my bald pate, I dislike this encoding stuff by poj (Abbot) on Mar 04, 2018 at 09:01 UTC
Empty string All the content is read into `@LINES` so `$filedata` is empty my(@LINES) = <ORDERFILE>; my $filedata = <ORDERFILE>; Maybe try my @LINES = <ORDERFILE>; my $filedata = join '',@LINES; or just open ORDERFILE, '<', $emailfile or die "$emailfile: $!"; my $filedata = do { local $/; <ORDERFILE> }; close ORDERFILE; poj	[reply] [d/l] [select]
Re^4: By the shine on my bald pate, I dislike this encoding stuff by jfrm (Monk) on Mar 04, 2018 at 11:23 UTC
Re^5: By the shine on my bald pate, I dislike this encoding stuff by poj (Abbot) on Mar 04, 2018 at 14:29 UTC
Re: By the shine on my bald pate, I dislike this encoding stuff by Anonymous Monk on Mar 06, 2018 at 11:09 UTC
`open (ORDERFILE, '<:encoding(UTF-8)', $emailfile) or return (@err, "Co +uld not open order email file: $emailfile"); #... my $filedata = <ORDERFILE>; #... eval { my $utf8 = decode("utf8", $filedata, Encode::FB_CROAK ) };` [download] First you're applying an IOLayer to a filehandle to obtain characters decoded from UTF-8, then you additionally decode unicode characters as if they were UTF-8 bytes. If this is working, it's by chance (i.e. when reading ASCII-only files). You should either open with `:encoding(UTF-8)` (but then you'll get warnings on non-UTF-8 text) or open without the IOLayer and do the decoding manually with FB_CROAK option.	[reply] [d/l] [select]


Don't ask to ask, just ask
	PerlMonks