Re^2: Perl detect utf8, iso-8859-1 encoding


Just another Perl shrine
	PerlMonks

Re^2: Perl detect utf8, iso-8859-1 encoding

by swiftlet (Acolyte)

on Jul 25, 2020 at 00:50 UTC ( [id://11119783]=note: print w/replies, xml )

Need Help??

in reply to Re: Perl detect utf8, iso-8859-1 encoding
in thread Perl detect utf8, iso-8859-1 encoding

simply checking if the input is valid UTF-8 and assuming ISO-Latin-1 if not

Thanks! This is a good idea, but how could I find out if the input is a valid utf-8 or not? Both utf8::valid and utf8::is_utf8 are not working well in my examples

Comment on Re^2: Perl detect utf8, iso-8859-1 encoding

Replies are listed 'Best First'.

Re^3: Perl detect utf8, iso-8859-1 encoding
by haj (Vicar) on Jul 25, 2020 at 08:50 UTC

Encode

use Encode qw(decode encode);
$characters = decode('UTF-8', $octets, 
                     Encode::FB_CROAK | Encode::LEAVE_SRC);
[download]

This code will die if there are invalid data, so you would wrap it into the exception handler of your choice, plain eval and Try::Tiny seem to be popular.

BTW: as jcb already indicated, chances are excellent that if data pass as UTF-8, they actually are UTF-8. All bytes of multibyte characters in valid UTF-8 strings are in the range \x80 to \xFF, and in particular the bytes 2-4 are in the range \x80-\xBF. You just can't build readable text from characters in that range in any of the ISO-8859-* encodings, and about half of that range are "unprintable" control characters from ISO/IEC 6429.

[reply]
[d/l]

In Section Seekers of Perl Wisdom

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: note [id://11119783]
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others taking refuge in the Monastery: (4)

As of 2024-04-26 06:54 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found