jfrm has asked for the wisdom of the Perl Monks concerning the following question:
I spent most of last week wading/stumbling through a conversion of a large part of my system to UTF-8. After much reading and inexplicableness, I'm virtually there. Still have one problem though; one routine expects UTF-8 data and can crash if it gets some from a non-UTF-8 file. So I need to test a file for UTF-8-edness. I have read much documentation/Perlmonks/Stackoverflow and apparently the following should work:
open (ORDERFILE, '<:encoding(UTF-8)', $emailfile) or return (@err, "Co
+uld not open order email file: $emailfile");
my(@LINES) = <ORDERFILE>;
my $filedata = <ORDERFILE>;
close(ORDERFILE);
use Encode;
eval { my $utf8 = decode("utf8", $filedata, Encode::FB_CROAK ) };
return(@err, "File was not encoded in UTF-8") if ($@);
But I have ANSII files for which this doesn't return but just outputs lots of warnings such as: utf8 "\xA3" does not map to Unicode. If I remove the '<:encoding(UTF-8)', argument from 'open', it still works but there are no warnings. A salient insight would be a welcome relief if there are any ideas?
Re: By the shine on my bald pate, I dislike this encoding stuff
by haukex (Archbishop) on Mar 04, 2018 at 11:52 UTC
|
In addition to the issue poj pointed out with reading from
<ORDERFILE> twice (my(@LINES) = <ORDERFILE> reads all lines from
the file, so $filedata would normally be empty), I just wanted to point
out that the pattern eval {...}; if ($@) {...} has issues
and that the pattern eval {...; 1} or do {...} or a module like
Try::Tiny is better. Also, nowadays lexical filehandles (open my $fh, ...)
are generally preferred over bareword filehandles (open ORDERFILE, ...).
(Update: The AM also made a good point that you appear to be decoding the data twice.)
Really, the best way to go is to know in advance what encoding your files
are in, and then opening them with the appropriate encoding in
open my $fh, '<:encoding(...)', $filename or die $!;
You may want to have a look at The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
| [reply] [d/l] [select] |
Re: By the shine on my bald pate, I dislike this encoding stuff
by Anonymous Monk on Mar 04, 2018 at 14:24 UTC
|
Strictly speaking, a file containing "\xA3" is not ASCII, since ASCII only consists of the characters from "\x00" to "\x7F". Maybe it's ISO Latin-1?
Also, your logic double-decodes the file. Assuming it is UTF-8, opening it '<:encoding(UTF-8)' decodes it, and then your decode() decodes it again.
My knee-jerk would be to apply Encode::Guess to the problem, since that way somebody else has worked out this mess for you, and since if you are going to convert the file to UTF-8 you need to know what its encoding currently is. If I just wanted to know if the file decoded as UTF-8 I might be lazy and do something like
open my $orderfile, '<:raw', $emailfile
or return( @err, "Could not open $emailfile: $!" );
local $/ = undef;
my $filedata = <$orderfile>;
close $orderfile;
use Encode;
eval {
decode( "utf-8", $filedata, Encode::FB_CROAK );
1;
} or return( @err, "File was not encoded in UTF-8" );
One possible source of confusion in this horrible mess is that the ASCII encoding is a subset of the UTF-8 encoding, so technically there is no way to distinguish between a file encoded in ASCII and a file encoded in UTF-8 | [reply] [d/l] [select] |
|
Yep. Betcha the real problem is that the files which contain "non-ASCII characters" didn't use Unicode (UTF-8, UTF-16) to encode those characters, but instead used old-style code pages. But the program's logic assumes that it's Unicode without checking the entire file. I didn't see the OP ever describing what the nature of the "crash" actually is.
| [reply] |
Re: By the shine on my bald pate, I dislike this encoding stuff
by Anonymous Monk on Mar 04, 2018 at 08:00 UTC
|
| [reply] |
|
use Encode::Guess;
my $decoder = Encode::Guess->guess($filedata); # default detectable e
+ncodings include utf8.
return(@err, "Can't guess encoding: $decoder") unless ref($decoder);
This is still no good as it fails for all files, ANSII and UTF-8 with error Can't guess encoding: Empty string, empty guess. | [reply] [d/l] [select] |
|
my(@LINES) = <ORDERFILE>;
my $filedata = <ORDERFILE>;
Maybe try
my @LINES = <ORDERFILE>;
my $filedata = join '',@LINES;
or just
open ORDERFILE, '<', $emailfile or die "$emailfile: $!";
my $filedata = do { local $/; <ORDERFILE> };
close ORDERFILE;
poj | [reply] [d/l] [select] |
|
|
Re: By the shine on my bald pate, I dislike this encoding stuff
by Anonymous Monk on Mar 06, 2018 at 11:09 UTC
|
open (ORDERFILE, '<:encoding(UTF-8)', $emailfile) or return (@err, "Co
+uld not open order email file: $emailfile");
#...
my $filedata = <ORDERFILE>;
#...
eval { my $utf8 = decode("utf8", $filedata, Encode::FB_CROAK ) };
First you're applying an IOLayer to a filehandle to obtain characters decoded from UTF-8, then you additionally decode unicode characters as if they were UTF-8 bytes. If this is working, it's by chance (i.e. when reading ASCII-only files).
You should either open with :encoding(UTF-8) (but then you'll get warnings on non-UTF-8 text) or open without the IOLayer and do the decoding manually with FB_CROAK option.
| [reply] [d/l] [select] |
|
|