Perl detect utf8, iso-8859-1 encoding

swiftlet has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl detect utf8, iso-8859-1 encoding by haukex (Archbishop) on Jul 24, 2020 at 13:54 UTC
If you're just trying to tell the difference between those two encodings, then note that a lot of text encoded with Latin1 is not valid UTF-8, so simply attempting to decode it as UTF-8 will already give you a very good hint. I demonstrated this with some code (plus heuristics) in this node.	[reply]
Re: Perl detect utf8, iso-8859-1 encoding by Your Mother (Archbishop) on Jul 24, 2020 at 19:24 UTC
Since no one has said it: detecting for encoding is always broken and should only ever be used as a last resort. If you have any way to either be sure of the encoding or enforce it up front, use it. If at all possible, look further up the chain for a way to do it correctly. Detection is incorrect, even if useful when there are no other options.	[reply]
Re^2: Perl detect utf8, iso-8859-1 encoding by swiftlet (Acolyte) on Jul 25, 2020 at 00:14 UTC
I have charset="UTF-8" in "Content-Type", "<meta>" tag, and accept-charset="UTF-8", anything else can I enforce it? Since our old iso-8859-1 links may have been saved by users or indexed by search engines, I can't find another way to get the search backward-compatible without detection, any suggestions?	[reply]
Re^3: Perl detect utf8, iso-8859-1 encoding by Your Mother (Archbishop) on Jul 25, 2020 at 00:23 UTC
That sounds like a good first link in the chain. If you are receiving all your input through forms from UTF-8 declared pages, then you are only receiving UTF-8 data and you can treat it that way, ignoring all other encodings and definitely have no need to guess. If that’s all true and you’re having problems, probably you are not decoding correctly from the first step to the next processing steps. We’d need more info about the full processing chain to guide you there. This is overwhelming—and overkill because most basic processing chains don’t need to consider most of it—but it is the Rosetta Stone for the issues: Why does modern Perl avoid UTF-8 by default?	[reply]
Re: Perl detect utf8, iso-8859-1 encoding by Corion (Patriarch) on Jul 24, 2020 at 09:10 UTC
Can you show us some short, self-contained example Perl code using Encode::Detect::Detector that replicates the behaviour? This helps us see how the Perl code crashes and maybe we find a good way around this.	[reply]
Re^2: Perl detect utf8, iso-8859-1 encoding by swiftlet (Acolyte) on Jul 24, 2020 at 10:06 UTC
Actually I shouldn't call it "crashed", it just can't detect if there are 2 "ö" Here is the code to duplicate the problem, one "ö" is fine, a space between 2 "ö ö" is fine, "öñ" (%F6%F1) is fine but not "öö" `use utf8; use Text::Unaccent; use Encode::Detect::Detector; ## my $author = "Sch%F6ttl"; my $author = "Sch%F6%F6ttl"; $author =~ s/%([a-zA-Z0-9][a-zA-Z0-9])/pack('C',hex($1))/eg; my $encoding = Encode::Detect::Detector::detect($author); print "encoding: $encoding: $author <br>\n"; if($encoding){ $author = unac_string($encoding, $author); print "after unac: $author<br>\n"; }` [download]	[reply] [d/l]
Re^3: Perl detect utf8, iso-8859-1 encoding by jeffenstein (Hermit) on Jul 24, 2020 at 14:32 UTC
I'm guessing the bug is in Text::Unaccent, but it's directly using the iconv C library, so I can't easily say for sure. However, maybe this can work: `use strict; use feature qw(unicode_strings say); use Unicode::Normalize 'NFD'; my $author = "Sch\x{f6}\x{f6}ttl"; $author = NFD $author; $author =~ s/\p{Combining_Diacritical_Marks}//g; say $author;` [download] This doesn't include and decode() or encode() of the incoming/outgoing strings. Also, I think that this can also break in cases where there are multiple combining characters.	[reply] [d/l]
Re^4: Perl detect utf8, iso-8859-1 encoding by swiftlet (Acolyte) on Jul 24, 2020 at 15:07 UTC
Re^5: Perl detect utf8, iso-8859-1 encoding by jeffenstein (Hermit) on Jul 24, 2020 at 20:06 UTC
Re^3: Perl detect utf8, iso-8859-1 encoding by Anonymous Monk on Jul 24, 2020 at 16:32 UTC
Well, that would be a bug ... please go to CPAN and report it.	[reply]
Re: Perl detect utf8, iso-8859-1 encoding by choroba (Cardinal) on Jul 24, 2020 at 20:13 UTC
I have never needed to guess the encoding, but I've noticed Encode::Guess mentioned several times here. Have you tried it? Does it have the same problems as the Encode::Detect::Detector? `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re: Perl detect utf8, iso-8859-1 encoding by bliako (Monsignor) on Jul 25, 2020 at 08:21 UTC
can statistical analysis of your input at the N-byte level help you? Specific to your texts? This algorithm usually involves statistical analysis of byte patterns, like frequency distribution of trigraphs of various languages encoded in each code page that will be detected; such statistical analysis can also be used to perform language detection. This process is not foolproof because it depends on statistical data. additionally as others have said, you can take advantage of this: One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test. However, badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding. For example, it was common that web sites in UTF-8 containing the name of the German city München were shown as MÃ¼nchen, due to the code deciding it was an ISO-8859 encoding before even testing to see if it was UTF-8. both quotations from https://en.wikipedia.org/wiki/Charset_detection 1' Edit: n-dimensional statistical analysis of DNA sequences (or text, or ...) can help you with n-dimensional sparse histograms otherwise CPAN may be of help. Math::Histogram is not sparse. However for N=4 it will be OK. bw, bliako	[reply]
Re: Perl detect utf8, iso-8859-1 encoding by jcb (Parson) on Jul 25, 2020 at 00:02 UTC
Fundamentally, you cannot reliably detect encodings. You can guess UTF-8 if the input is valid UTF-8, but that is still a guess at best. The problem is that pre-Unicode encodings actually made full use of the available 256 codepoints in an octet. UTF-8 must use those same 256 codepoints (and the lower 128 are ASCII), so all valid UTF-8 is also valid in other encodings. There is no general solution to this problem, although you might be able to make some headway with either a dictionary of valid names, or some rules for recognizing "plausible" names — that is, names that use only characters used in names from one language, since mixed-language names are highly unlikely. For the special case of deciding whether the input is UTF-8 as requested or ISO-Latin-1 due to following an outdated link, you can probably make good progress by simply checking if the input is valid UTF-8 and assuming ISO-Latin-1 if not. This is not exactly correct, but is probably a fair heuristic.	[reply]
Re^2: Perl detect utf8, iso-8859-1 encoding by swiftlet (Acolyte) on Jul 25, 2020 at 00:50 UTC
simply checking if the input is valid UTF-8 and assuming ISO-Latin-1 if not Thanks! This is a good idea, but how could I find out if the input is a valid utf-8 or not? Both utf8::valid and utf8::is_utf8 are not working well in my examples	[reply]
Re^3: Perl detect utf8, iso-8859-1 encoding by haj (Vicar) on Jul 25, 2020 at 08:50 UTC
To check whether data are valid UTF-8 is rather straightforward. Here's the example, slightly modified from the synopsis of Encode: `use Encode qw(decode encode); $characters = decode('UTF-8', $octets, Encode::FB_CROAK \| Encode::LEAVE_SRC);` [download] This code will `die` if there are invalid data, so you would wrap it into the exception handler of your choice, plain `eval` and Try::Tiny seem to be popular. BTW: as jcb already indicated, chances are excellent that if data pass as UTF-8, they actually are UTF-8. All bytes of multibyte characters in valid UTF-8 strings are in the range `\x80` to `\xFF`, and in particular the bytes 2-4 are in the range `\x80-\xBF`. You just can't build readable text from characters in that range in any of the ISO-8859-* encodings, and about half of that range are "unprintable" control characters from ISO/IEC 6429.	[reply] [d/l]
Re: Perl detect utf8, iso-8859-1 encoding by ikegami (Patriarch) on Jul 25, 2020 at 08:25 UTC
See this	[reply]
Re: Perl detect utf8, iso-8859-1 encoding by swiftlet (Acolyte) on Jul 25, 2020 at 09:35 UTC
I am afraid I do not have the luxury to discard all non-utf8 input, but I can simplify the code: if the input is not detected as utf8, just treat it as iso-8859-1 use Text::Unaccent; use Encode::Detect::Detector; # my $author = "Sch%F6%E5ttl"; # my $author = "Sch%C3%A9ttl"; # my $author = "Sch%C3%B6ttl"; # my $author = "Sch%F6%F6ttl"; # my $author = "Sch%F6 %F4ttl"; my $author = "teoria elasticit%E0"; $author =~ s/%([a-zA-Z0-9][a-zA-Z0-9])/pack('C',hex($1))/eg; my $encoding = Encode::Detect::Detector::detect($author); if($encoding !~ m#utf-8#i){ $encoding = "iso-8859-1"; } if($encoding){ $author = unac_string($encoding, $author); print "after unac: $author<br>\n"; } [download] Seems like it's working better, any potential problem?	[reply] [d/l]


more useful options
	PerlMonks