Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Perl detect utf8, iso-8859-1 encoding

by bliako (Prior)
on Jul 25, 2020 at 08:21 UTC ( #11119789=note: print w/replies, xml ) Need Help??


in reply to Perl detect utf8, iso-8859-1 encoding

can statistical analysis of your input at the N-byte level help you? Specific to your texts?

This algorithm usually involves statistical analysis of byte patterns, like frequency distribution of trigraphs of various languages encoded in each code page that will be detected; such statistical analysis can also be used to perform language detection. This process is not foolproof because it depends on statistical data.

additionally as others have said, you can take advantage of this:

One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test. However, badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding. For example, it was common that web sites in UTF-8 containing the name of the German city München were shown as München, due to the code deciding it was an ISO-8859 encoding before even testing to see if it was UTF-8.

both quotations from https://en.wikipedia.org/wiki/Charset_detection

1' Edit: n-dimensional statistical analysis of DNA sequences (or text, or ...) can help you with n-dimensional *sparse* histograms otherwise CPAN may be of help. Math::Histogram is not sparse. However for N=4 it will be OK.

bw, bliako

  • Comment on Re: Perl detect utf8, iso-8859-1 encoding

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11119789]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (2)
As of 2020-11-30 04:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?