Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: RT::Client turns occasional binary characters in to wide characters

by cavac (Curate)
on Oct 03, 2018 at 11:24 UTC ( #1223454=note: print w/replies, xml ) Need Help??


in reply to RT::Client turns occasional binary characters in to wide characters

Tried this?

use Encode qw[encode_utf8 is_utf8]; ... if(is_utf8($mystring)) { $mystring = encode_utf8($mystring); }

This should give you a string with only bytes 0x00 to 0xFF. In all likelyhood, the module you are using is treating incoming data as UTF8 and decodes that into characters.

"For me, programming in Perl is like my cooking. The result may not always taste nice, but it's quick, painless and it get's food on the table."
  • Comment on Re: RT::Client turns occasional binary characters in to wide characters
  • Download Code

Replies are listed 'Best First'.
Re^2: RT::Client turns occasional binary characters in to wide characters
by Anonymous Monk on Oct 03, 2018 at 22:10 UTC
    Please do not propagate the trap of using is_utf8 for Perl code. It does not indicate if the string you have is UTF-8 encoded bytes. It is only an internal flag for Perl's own use and XS code. It is possible, especially after people try hacks like this, or write incomplete XS code, to have byte-strings where is_utf8 is true, and character strings where is_utf8 is false. I would link to some RT bugs for more reading about the issue, but the website doesn't allow me to post them.
      Thanks for the response. Given that this string that I am retrieving is actually the contents of a binary file then I should be OK to ignore anything to do with UTF8, given that my source code has no eight-bit or more characters.

      Working from that I removed every reference to UTF8 subroutines from my code but I still get this wide character complaint when I try and write the string contents out to a binary (or any) file. So I have removed one potential issue (UTF8) but it's still got a problem.

      While I take you at your word that this is not a UTF8 problem (as I understand it) It's odd that running encode('UTF-8'... against the string and writing the results out does not generate this wide character warning.

        Given that this string that I am retrieving is actually the contents of a binary file then I should be OK to ignore anything to do with UTF8, given that my source code has no eight-bit or more characters.

        It depends on how the data is handed to you. Note how below, both byte sequences are \304\243, but they're getting different interpretations based on Perl's internal UTF8 flag. If the module is handing you binary data with some encoding/decoding issues or perhaps the UTF8 flag incorrectly enabled, you'll have these kinds of strange issues that may explain the presence of U+FFFD REPLACEMENT CHARACTER in your original hex dump. Could you show your data with Devel::Peek?

        $ perl -CSD -MDevel::Peek -le 'my $x="\x{123}"; print $x; Dump($x)'
        ģ
        SV = PV(0x1337d70) at 0x1357518
          REFCNT = 1
          FLAGS = (POK,IsCOW,pPOK,UTF8)
          PV = 0x1359790 "\304\243"\0 [UTF8 "\x{123}"]
          CUR = 2
          LEN = 10
          COW_REFCNT = 1
        $ perl -CSD -MDevel::Peek -le 'my $x="\304\243"; print $x; Dump($x)'
        ģ
        SV = PV(0x1e28d70) at 0x1e48518
          REFCNT = 1
          FLAGS = (POK,IsCOW,pPOK)
          PV = 0x1e4a790 "\304\243"\0
          CUR = 2
          LEN = 10
          COW_REFCNT = 1
        
Re^2: RT::Client turns occasional binary characters in to wide characters
by wardmw (Acolyte) on Oct 03, 2018 at 15:56 UTC
    Thanks for that. According to is_utf8() the string is in UTF8, however running encode_utf8() doesn't resolve the problem. it *does* remove then 4 character hex, but doesn't put the code back to what it was originally:
    encode_utf8() version: 00000000 50 4B 03 04 14 00 09 00 08 00 67 EF B +F BD 25 46 PK........g...%F Original version: 0000000 50 4b 03 04 14 00 09 00 08 00 67 8d 2 +5 46 00 00
    I took a look at the attributes of the file, as @Veltro suggested and got the following:
    content_type is: application/octet-stream content_encoding is: none file_name is: screenshot-172 21 242 64.zip headers is: Content-Type: application/octet-stream; name="screenshot-1 +72 21 242 64.zip" Content-Disposition: attachment; filename="screenshot-172 21 242 64.zi +p" Content-Transfer-Encoding: base64 Content-Length: 460749

    That "base64" string in the headers section looked interesting although the string does not seem to be encoded insofar as is has characters in it that do not match the Base64 character set (A-Za-z0-9+/=).

    I tried encoding and decoding using the MIME functions but to no avail.

    The content length stated is the exact size of the actual binary file (460749 bytes) but the string provided by the RT libraries is different (442958 bytes). I would be willing to believe that the missing 17791 characters are included in the wide characters in the RT string, that is to say that I expect there to be 17791 wide characters in the octet stream.

      This is another reason why is_utf8 is a trap. It does not indicate the string is "in UTF-8". It is an internal flag that describes how Perl is internally storing the string. utf8::upgrade and utf8::downgrade enable and disable this flag respectively without any change to the string (as used in Perl code) (as long as the string can be represented in your native encoding, otherwise utf8::downgrade will croak). So in fact, the only sure thing you can determine from is_utf8 is that every Perl string with codepoints above U+FF *must* have it enabled (but not the other way around).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1223454]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2021-02-27 17:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?