Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

How to interpret characters in Devel::Peek CUR

by ait (Hermit)
on Jun 09, 2020 at 04:38 UTC ( [id://11117850]=perlquestion: print w/replies, xml ) Need Help??

ait has asked for the wisdom of the Perl Monks concerning the following question:

Pardon my ignorance on some of these internals, but I am going crazy transferring data through 3 different databases in different charsets through different layers (ssh, file transfers, direct SQL), etc.
Context: Trying to figure out why PHP::Serialization reports 30 as the string length of the 26 char string. When I serialize to the PHP Array I get this:

s:30:"Triple “S” Industrial Corp"

So I am trying to figure out if the bug is in the PHP::Serialization or somewhere else in this crazy 3 system interface. The PHP on the target server is 7.2.10 so I am assuming it supports these UTF chars w/o issue. But what seems strange to me is that both Perl and PHP would both internally represent 30 in character length? So before I dive into that module's code to try to understand what it's doing, I want to first understand how Perl stores this internally..

So given this string: Triple “S” Industrial Corp (note funky quotes), this is the Dump:

SV = PV(0x5584829062e0) at 0x558482ad2ee0
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK)
PV = 0x558482b75b30 "Triple \342\200\234S\342\200\235 Industrial Corp"\0
CUR = 30
LEN = 56
COW_REFCNT = 0

What are the characters \342\200\234 (the left funky quote)?
How would I manually decode them if I wanted to ? (i.e. is this a utf8 sequence? how do I know what they mean?)
Is this is why CUR reports 30 "perl characters" instead of 26 actual characters?

  • Comment on How to interpret characters in Devel::Peek CUR

Replies are listed 'Best First'.
Re: How to interpret characters in Devel::Peek CUR
by haukex (Archbishop) on Jun 09, 2020 at 07:58 UTC

    Please see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

    The Unicode character U+201C LEFT DOUBLE QUOTATION MARK (“) is encoded in UTF-8 as the bytes e2 80 9c (\342\200\234), and the Unicode character U+201D RIGHT DOUBLE QUOTATION MARK (”) is encoded in UTF-8 as the bytes e2 80 9d (\342\200\235).

    One way to think about Perl strings is that they store either a sequence of bytes or a sequence of Unicode characters. In your case, the Devel::Peek output does not include the "UTF8" flag, which means that this string is bytes, and yes, that's why you're getting a length of 30. (Update: It is important to note, however, that testing a string's UTF8 flag for anything other than debugging is code smell - your code should normally rely on the fact that you're getting strings in the correct format.)

    You can decode bytes to characters or encode characters to bytes using the Encode module, or, in the case of UTF-8, use the "built-in" utf8 module (note that you don't have to put use utf8; in your code to load it; use utf8 means "this Perl source file is encoded in UTF-8", which may or may not be what you want). You can use utf8::decode($string); to decode the string you have, and then you'll see this output:

    SV = PV(0x5584829062e0) at 0x558482ad2ee0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x558482b75b30 "Triple \342\200\234S\342\200\235 Industrial Cor +p"\0 [UTF8 "Triple \x{201c}S\x{201d} Industrial Corp"] CUR = 30 LEN = 32

    And length will now report 26. The UTF8 flag means that the Perl string is storing Unicode characters (the fact that they're stored internally as UTF-8 should be considered an implementation detail). Almost all Perl operators (depending on the Perl version) and many Perl modules should handle Unicode correctly.

    Note that it's usually best to decode data as it's coming into Perl (e.g. specifying an open mode of '<:encoding(UTF-8)') and encode it as it leaves, and having to do this manually in your code sometimes means that the source where you're getting the data may be buggy in regards to Unicode. I don't know enough about PHP::Serialization to say if that's the case here, and the PHP serialize docs don't make any mention of Unicode either. Interestingly, the PHP String docs say "PHP only supports a 256-character set, and hence does not offer native Unicode support." So my guess is that the encoding to bytes happens somewhere before the data hits the PHP string, and then serialize and PHP::Serialization simply pass those bytes through; this means you'd have to know which encoding was used to store the Unicode data into the PHP string to correctly decode it, in the case that it's not always UTF-8.

    As a general note, if you're working with Unicode it's best to be on the latest version of Perl and to put a use 5.030; at the top of the file to enable all of its features.

      Thanks a lot for this detailed answer!

      I read the doc you recommended and although I knew some of the stuff in there, it is definitely a great read and clarifies part of the untold story. It also helped me understand your answer better, for example:

      the fact that they're stored internally as UTF-8 should be considered an implementation detail

      I think found what seems to be the root cause of the issue:

      We are pulling data from an SQL Server database that is encoded in CP-1252 and we are using the DBI with the MS ODCB Driver for Linux version 13. It seems they are inserting UTF-8 data into that SQL Server, so when we get the data back in Perl the UTF-8 flag is not set (even though some records actually contain UTF-8 characters).

      When we insert that data into our UTF-8 PostgreSQL debase, it seems to get double encoded. Also, some of these flawed records have a null terminator at the end too, which doesn't seem to affect the utf8 flag but it does mess up our trimming (The SQL Server char strings are padded with whitespace).

      Data from SQL Server

      SV = PV(0x560c8bacdf90) at 0x560c8b9b7998 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x560c8bbfad00 "Triple \342\200\234S\342\200\235 Industrial Cor +p \0"\0 CUR = 51 LEN = 53 COW_REFCNT = 1

      Data after being Stored in Postgres (and retrieved)

      SV = PV(0x560c8bacdec0) at 0x560c8bb7c7b0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x560c8bbf38f0 "RAW: Triple \303\242\302\200\302\234S\303\242\3 +02\200\302\235 Industrial Corp "\0 [UTF8 "RAW: Tri +ple \x{e2}\x{80}\x{9c}S\x{e2}\x{80}\x{9d} Industrial Corp + "] CUR = 61 LEN = 63

      Using utf8::decode on the string before storing into Postgres actually solves the issue. So knowing that the SQLServer end (which is out of our control) has UTF-8 data in a CP-1252 database would it be so wrong to force utf8::decode on all strings? Or is there a better way to deal with this?

        As I previously commented here, much depends on the way you set up the connection, and there is still room to play with server-side encodings:

        I recently worked from perl on Linux with a MS SQL server database, and got the best results with FreeTDS:

        my $dbh = DBI->connect ("dbi:ODBC:mssql_freetds", $username, $password +, \%dbi_attributes);
        $ cat ~/.odbc.ini [mssql_freetds] Description = My MS SQL database Driver = FreeTDS TDS version = 7.2 Trace = No Server = mysql.server.local Port = 1433 Database = DatabaseName User = UserName Password = PassWord Client Charset = UTF-8

        The biggest difference between FreeTDS and the MS ODBC driver is the return type of UUID field. The MS ODBC does not allow nested queries, whereas the FreeTDS driver does. So I used the ODBC driver to make a CSV dump of the database and the FreeTDS driver to actually work with the database.

        For ODBC I did

        my $dbh = DBI->connect ("dbi:ODBC:mssql_odbc", $username, $password, \ +%dbi_attributes);
        $ cat ~/.odbc.ini [mssql_odbc] Description = My MS SQL database Driver = ODBC Driver 17 for SQL Server Server = mysql.server.local Database = DatabaseName User = UserName Password = PassWord

        Also make sure you put the fully qualified hostname in the server name. localhost will not work.


        Enjoy, Have FUN! H.Merijn
        So knowing that the SQLServer end (which is out of our control) has UTF-8 data in a CP-1252 database would it be so wrong to force utf8::decode on all strings? Or is there a better way to deal with this?

        It's too bad that the server is out of your control, since that seems to be the source of the problem. But anyway, yes, I think fixing the issue as early as possible - as you pull the data off the server - is the "best" (relatively) way to go about it. Two things to keep in mind: Make sure that all the data really is UTF-8, and check the return value of utf8::decode(), because if that fails, then there's definitely something wrong with the encoding. But keep in mind that false negatives (e.g. data that is actually CP-1252 but also decodes as UTF-8) are possible, though somewhat unlikely.

Re: How to interpret characters in Devel::Peek CUR
by kcott (Archbishop) on Jun 09, 2020 at 05:37 UTC

    G'day ait,

    The characters, “ and ”, are U+201C and U+201D. The numbers \342\200\234 and \342\200\235 are the octal values of the bytes that make up those characters.

    You can break those characters into their constituent bytes and check the octal values like this:

    $ perl -C -E ' my $x = "\x{201c}S\x{201d}"; say $x; { use bytes; printf "%vo\n", $x; } ' “S” 342.200.234.123.342.200.235

    See also: bytes noting the emboldened warning; and the vector flag information in sprintf.

    — Ken

      The characters, “ and ”, are U+201C and U+201D. The numbers \342\200\234 and \342\200\235 are the octal values of the bytes that make up those characters.

      Sorry, but this leaves out a very important bit: these are the bytes that make up the characters when encoded as UTF-8.

        What he said :).

        In EBCDIC land you'd get something completely different:

        $ perl -MData::Peek -wE'say $^O;DPeek ("\x{201c}"); DPeek ("\x{201d}") +' os390 PV("\312\101\160"\0) [UTF8 "\x{201c}"] PV("\312\101\161"\0) [UTF8 "\x{201d}"]

        Enjoy, Have FUN! H.Merijn

      Wow, thanks for the use bytes trick! Curiously I use Perl for another project where I translate REST into Modbus and I use a lot of pack and unpack, but I never used bytes before. Thanks!!

      Thank you kcott ! The bytes nugget was a great tip!

Re: How to interpret characters in Devel::Peek CUR
by ikegami (Patriarch) on Jun 09, 2020 at 17:07 UTC

    So given this string: Triple “S” Industrial Corp (note funky quotes)

    More precisely, you have this text encoded using UTF-8.

    What are the characters \342\200\234 (the left funky quote)

    Octal escape sequences that produce the bytes that form the encoding of «» using UTF-8.

    use feature qw( say ); use Encode qw( encode ); say encode("UTF-8", "\N{LEFT DOUBLE QUOTATION MARK}") eq "\342\200\234"; # Output: 1

    How would I manually decode them if I wanted to ?

    You could use
    utf8::decode($s);

    If this string was constructed from a string literal, then you should have used the following to tell Perl the source was encoded using UTF-8 instead of ASCII:

    use utf8;

    If this is read from a file, an encoding layer would do this automatically for you. You can set this up using

    use open ':std', ':encoding(UTF-8)';

    Is this is why CUR reports 30 "perl characters" instead of 26 actual characters?

    The string has 30 characters, not 26. You can verify this using length. If you were to decode those 30 bytes, you would get 26 Unicode Code Points, but that would be a different string, and length would return 26.

    use feature qw( say ); use Encode qw( decode ); no utf8; my $utf8 = "Triple “S” Industrial Corp"; say length($utf8); # 30 chars my $ucp = decode("UTF-8", $utf8); say length($ucp); # 26 chars

    That said, CUR indicates the number of bytes of the string buffer that are being used, not the number of characters in the string. They just happen to be the same for your string.

    use feature qw( say ); use Encode qw( decode ); use Devel::Peek qw( Dump ); no utf8; my $utf8 = "Triple “S” Industrial Corp"; say length($utf8); # 30 chars Dump($utf8); # CUR = 30 my $ucp = decode("UTF-8", $utf8); say length($ucp); # 26 chars Dump($ucp); # CUR = 30

    Because we called length before Dump, you'll see the PERL_MAGIC_utf8 (w) magic was added to cache the length (MG_LEN = 26).

      Thanks ikegami for taking the time to show TMTOWTDI with built in itf8 and with Encode!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11117850]
Approved by kcott
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2024-04-16 04:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found