Re: RT::Client turns occasional binary characters in to wide characters

Replies are listed 'Best First'.
Re^2: RT::Client turns occasional binary characters in to wide characters by Anonymous Monk on Oct 03, 2018 at 22:10 UTC
Please do not propagate the trap of using is_utf8 for Perl code. It does not indicate if the string you have is UTF-8 encoded bytes. It is only an internal flag for Perl's own use and XS code. It is possible, especially after people try hacks like this, or write incomplete XS code, to have byte-strings where is_utf8 is true, and character strings where is_utf8 is false. I would link to some RT bugs for more reading about the issue, but the website doesn't allow me to post them.	[reply]
Re^3: RT::Client turns occasional binary characters in to wide characters by wardmw (Acolyte) on Oct 08, 2018 at 15:53 UTC
Thanks for the response. Given that this string that I am retrieving is actually the contents of a binary file then I should be OK to ignore anything to do with UTF8, given that my source code has no eight-bit or more characters. Working from that I removed every reference to UTF8 subroutines from my code but I still get this wide character complaint when I try and write the string contents out to a binary (or any) file. So I have removed one potential issue (UTF8) but it's still got a problem. While I take you at your word that this is not a UTF8 problem (as I understand it) It's odd that running `encode('UTF-8'...` against the string and writing the results out does not generate this wide character warning.	[reply] [d/l]
Re^4: RT::Client turns occasional binary characters in to wide characters by haukex (Archbishop) on Oct 08, 2018 at 16:13 UTC
Given that this string that I am retrieving is actually the contents of a binary file then I should be OK to ignore anything to do with UTF8, given that my source code has no eight-bit or more characters. It depends on how the data is handed to you. Note how below, both byte sequences are `\304\243`, but they're getting different interpretations based on Perl's internal UTF8 flag. If the module is handing you binary data with some encoding/decoding issues or perhaps the UTF8 flag incorrectly enabled, you'll have these kinds of strange issues that may explain the presence of `U+FFFD REPLACEMENT CHARACTER` in your original hex dump. Could you show your data with Devel::Peek? $ perl -CSD -MDevel::Peek -le 'my $x="\x{123}"; print $x; Dump($x)' ģ SV = PV(0x1337d70) at 0x1357518 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x1359790 "\304\243"\0 [UTF8 "\x{123}"] CUR = 2 LEN = 10 COW_REFCNT = 1 $ perl -CSD -MDevel::Peek -le 'my $x="\304\243"; print $x; Dump($x)' ģ SV = PV(0x1e28d70) at 0x1e48518 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x1e4a790 "\304\243"\0 CUR = 2 LEN = 10 COW_REFCNT = 1	[reply] [d/l] [select]
Re^2: RT::Client turns occasional binary characters in to wide characters by wardmw (Acolyte) on Oct 03, 2018 at 15:56 UTC
Thanks for that. According to is_utf8() the string is in UTF8, however running encode_utf8() doesn't resolve the problem. it does remove then 4 character hex, but doesn't put the code back to what it was originally: `encode_utf8() version: 00000000 50 4B 03 04 14 00 09 00 08 00 67 EF B +F BD 25 46 PK........g...%F Original version: 0000000 50 4b 03 04 14 00 09 00 08 00 67 8d 2 +5 46 00 00` [download] I took a look at the attributes of the file, as @Veltro suggested and got the following: `content_type is: application/octet-stream content_encoding is: none file_name is: screenshot-172 21 242 64.zip headers is: Content-Type: application/octet-stream; name="screenshot-1 +72 21 242 64.zip" Content-Disposition: attachment; filename="screenshot-172 21 242 64.zi +p" Content-Transfer-Encoding: base64 Content-Length: 460749` [download] That "base64" string in the headers section looked interesting although the string does not seem to be encoded insofar as is has characters in it that do not match the Base64 character set (A-Za-z0-9+/=). I tried encoding and decoding using the MIME functions but to no avail. The content length stated is the exact size of the actual binary file (460749 bytes) but the string provided by the RT libraries is different (442958 bytes). I would be willing to believe that the missing 17791 characters are included in the wide characters in the RT string, that is to say that I expect there to be 17791 wide characters in the octet stream.	[reply] [d/l] [select]
Re^3: RT::Client turns occasional binary characters in to wide characters by Anonymous Monk on Oct 03, 2018 at 22:14 UTC
This is another reason why is_utf8 is a trap. It does not indicate the string is "in UTF-8". It is an internal flag that describes how Perl is internally storing the string. utf8::upgrade and utf8::downgrade enable and disable this flag respectively without any change to the string (as used in Perl code) (as long as the string can be represented in your native encoding, otherwise utf8::downgrade will croak). So in fact, the only sure thing you can determine from is_utf8 is that every Perl string with codepoints above U+FF must have it enabled (but not the other way around).	[reply]


more useful options
	PerlMonks