Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

utf8 char or binary string detection

by igoryonya (Pilgrim)
on Nov 07, 2015 at 08:04 UTC ( [id://1147176]=perlquestion: print w/replies, xml ) Need Help??

igoryonya has asked for the wisdom of the Perl Monks concerning the following question:

I have a variable, that sometimes contains a valid utf8 string, that looks correctly on the stdout and sometimes, it contains a string of bytes, which looks gibberish on the stdout.
if I use: $val = decode('utf8', $val); on the invalid string (the one with bytes), it starts showing correctly on the stdout.
If I test the $val with: utf8::is_utf8($val) before decoding it, it results to true on either, the bad and the good string.
How can I automatically determine, if the string is bytes and only then do the decode command?

Replies are listed 'Best First'.
Re: utf8 char or binary string detection
by Laurent_R (Canon) on Nov 07, 2015 at 09:39 UTC
    Even if the string is only a stream of random bytes, and especially of the string is short, it may sometimes happen that those bytes turn out to be a valid utf8 string (probably meaningless in your context, but still technically valid). It is presumably too late to know at this point.

    Maybe you should say more about the general process, to figure out if something can be done upstream.

Re: utf8 char or binary string detection
by Anonymous Monk on Nov 08, 2015 at 17:01 UTC
    I have a variable, that sometimes contains a valid utf8 string, that looks correctly on the stdout and sometimes, it contains a string of bytes, which looks gibberish on the stdout.
    The thing is, there is only one type of string in Perl.
    How can I automatically determine, if the string is bytes and only then do the decode command?
    You can't (in general case). Which may or may not be an XY problem.

    So, what are you actually trying to do?

    I have a subroutine, I've wrote for outputting stuff to stdout. I do not use print directly, because my subroutine handles everything automatically, so that I can use one program to use in CGI, terminal STDOUT and GUI without rewriting. I need a way in that subroutine to detect, if the variable, that it recieved is utf8 or a byte string.
    Why do you need to detect that? Is that because "conversion to utf8 can brake some filenames"? Perhaps that's not really a problem, just let decode blow up and catch the error (with eval). Something like that:
    use strict; use warnings; use Encode; my $enc_flags = Encode::FB_CROAK | Encode::LEAVE_SRC; binmode STDOUT, ':encoding(utf-8)'; while ( my $line = <> ) { chomp $line; my $decoded = eval { Encode::decode( 'utf-8', $line, $enc_flags ); } || bad_string( $line ); print $decoded, "\n"; } sub bad_string { # "upgrade" the string Encode::decode( 'latin-1', shift ); }
Re: utf8 char or binary string detection
by Anonymous Monk on Nov 07, 2015 at 08:37 UTC

    How can I automatically determine, if the string is bytes and only then do the decode command?

    What you're supposed to do, is fix the code that puts stuff into $val when it puts stuff into $val, not try to work around this later on, get it at the source

      Can't fix that. It's not an error in the program.
      When I get filenames, I have to use them in the byte representation, instead of utf8, because, conversion to utf8 can brake some filenames.
      Mostly, everything else needs to be in utf8.
      I use file system path modules in order to manipulate the dirs and files names, etc.
      When I print those path names to the screen, if they are not decoded, they show garbage in non-latin letters. Those modules, in certain cases, after manipulating the path names, keep strings in byte representation, but set the variable's utf8 flag on, which makes the variable contents, being represented in bytes with utf8 standard routine checking, thinking, that it's utf8 already.
      I have a subroutine, I've wrote for outputting stuff to stdout. I do not use print directly, because my subroutine handles everything automatically, so that I can use one program to use in CGI, terminal STDOUT and GUI without rewriting.
      I need a way in that subroutine to detect, if the variable, that it recieved is utf8 or a byte string.
      I've used to use: use encoding 'utf8', STDOUT => 'utf8';, and it worked automatically, but now, since perl's the version 5.20, or something, the encoding pragma is finally deprecated, so I have to think of an other way to solve this encoding issue.
Re: utf8 char or binary string detection
by andal (Hermit) on Nov 09, 2015 at 08:57 UTC

    Disclaimer: my answer is based on how I understood your question, if it is wrong, then please try to rephrase your question to make it easier to understand :)

    First of all, some clarification. When your program obtains something from the system, it is always just a sequence of bytes. After that, the program may decide to treat this sequence as something more, than just bytes. When your program gives something back to the system, then it must be again just a sequence of bytes, does not make any difference, how the program was viewing it before.

    The text (including file names) may contain multi-byte characters, which follow certain rules (encoding). Still, to the system the text appears to be just sequence of bytes. When perl program receives this sequence, it may decide to view it as chain of characters. For that the function Encode::decode is provided. In fact, to add more convenience and confusion, this functionality can be attached to input stream, but at the base level one just converts bytes to characters using Encode::decode. At this time, the piece of data receives marker "is_utf8". It does not mean, that the text is really in utf-8, it just means, that perl tries to work with it as with characters.

    When you want to give that data back to the system, for example during printing to screen, or when writing to file, then you must convert it back to bytes using Encode::encode. This strips the "is_utf8" flag from the data. Again, to add confusion, this conversion may be attached to the output stream.

    As a "side-effect" both of the functions may perform conversion from one characters encoding, to another character encoding, but that can create problems if input does not contain text in expected character encoding.

    The function utf8::is_utf8 just reports, if perl sees the piece of data as "chain of characters" instead of "chain of bytes". Printing out such data normally produces warning, since for output one must give only "chain of bytes". Again, you can manipulate output stream to automatically perform conversion and avoid warning.

    Now, to the problem with file names and utf-8. Quite often "double conversion" may happen. A program gives to the system string containing for example bytes representing Russian characters in UTF-8 encoding. The file system receives this string, but it has an option indicating, that all input to it is in Latin1 encoding and must be converted to UTF-8 encoding. So, the file system converts all data one more time, as result, the user shall see junk, even though this junk is valid UTF-8 encoding. That is why, when mounting external disks I usually provide option "utf8" to the mount command.

    Obviously, if your program gets junk encoded as UTF-8, then there's no way for your program to fix things, unless you know how the junk was created in the first place. For example, in the above case, when legal UTF-8 was treated as Latin1 and converted one more time to UTF-8, one can try to do the reverse conversion from UTF-8 to Latin 1. Something like Encode::from_to($bytes, 'UTF-8', 'Latin1'). Again, this is only if you know why you got the junk.

    In general, to avoid problems, one should just follow simple rule "when communicating with the system, get and give only bytes (octets)". To achieve it, one can use either Encode module or various pragmas. When working with modules, one have to carefully learn, what those modules expect/produce. If it is not documented, then one has to experiment. Here the "utf8::is_utf8" or "Encode::is_utf8" can be used to check, whether multi-byte data is treated as sequence of bytes, or as sequence of characters.

Re: utf8 char or binary string detection
by nikosv (Deacon) on Nov 08, 2015 at 19:43 UTC
    don't rely on the utf8 flag in any case. It doesn't do what you think it does.

    Other than that,it don't understand what you mean 'if the string is bytes'. It's always bytes, it's the encoding that morphs it (into utf8 or otherwise)

    I think you meant 'I want to know when it is or is not utf8'.
    You can't know becuase utf does not have a signature to denote that what follows is utf,except when the BOM is used, which is rarely

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1147176]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (9)
As of 2024-03-28 18:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found