Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^3: JSON::XS Cyrillic unicode not saving properly

by choroba (Cardinal)
on Mar 21, 2021 at 16:08 UTC ( [id://11130038]=note: print w/replies, xml ) Need Help??


in reply to Re^2: JSON::XS Cyrillic unicode not saving properly
in thread JSON::XS Cyrillic unicode not saving properly

If you're using a recent version of DBD::Pg, you shouldn't need to set the parameter at all. The only way how I was able to get the wrong string back was to set it to 0 but not set the binmode of STDOUT. Try experimenting with the following script:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
use feature qw{ say };

use DBI;
use Encode;

for my $utf8 (1, -1, 0) {
    my $string = $utf8 ? 'Кирилл цагаан толгой'
                       : encode('UTF-8', 'Кирилл цагаан толгой');

    my $db = 'DBI'->connect('dbi:Pg:dbname=postgres', "", "");
    say $db->{pg_enable_utf8} = $utf8;

    $db->do('CREATE TABLE IF NOT EXISTS cyr (t TEXT)');
    $db->do('DELETE FROM cyr');

    my $insert = $db->prepare('INSERT INTO cyr (t) VALUES (?)');
    $insert->execute($string);

    my $select = $db->prepare('SELECT t FROM cyr');
    $select->execute;
    binmode *STDOUT, $utf8 ? ':encoding(UTF-8)' : ':raw';
    while (my @row = $select->fetchrow_array) {
        say @row;
    }
}

The script needs to be saved as UTF-8.

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Replies are listed 'Best First'.
Re^4: JSON::XS Cyrillic unicode not saving properly
by cormanaz (Deacon) on Mar 21, 2021 at 20:28 UTC
    Output is three lines of Cyrillic (I don't know how to get this website to accept Cyrillic chars). Same thing when I write to a file instead of STDOUT.

    I think the database transaction is not the problem. When I run Lingua::Identify on the items in the DB then compare the detected language to the item in the DB, it's getting the languages correct. It couldn't do that if the DB was returning non-Cyrillic characters, right?

    This suggests the problem is in the output. But I am setting binmode on the output file with the correct encoding, just like in your example. I also tried declaring utf-8 encoding in the open statement, but am still getting the ANSII ourput. This is perplexing.

Re^4: JSON::XS Cyrillic unicode not saving properly
by cormanaz (Deacon) on Mar 21, 2021 at 23:36 UTC
    Aha! I have verified that it's not the db transaction by running the code in a visual debugger. After the query @items contains tweets in Cyrillic. I printed those out to a flat file, and it opens Cyrillic. The problem is in JSON::XS.

    The docs say you have to use the OO interface and enable utf8 encoding. I tried doing this by changing the print statement to print OUT JSON::XS->new->utf8->encode($sample); but that still produces a json file with ascii characters. The docs on the OO interface are a little confusing. Anyone know the right way to do this?

      Hi,

      Here you go

      perlunitut: Unicode in Perl#I/O flow (the actual 5 minute tutorial)

      Now ask your program (or mine) , who is doing the byte encoding (what statement, what function/method )?

      #!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw/ dd /; use Path::Tiny qw/ path /; use JSON::XS(); use JSON::PP(); my @humps = "\x{FEFF}\x{1F42A} one hump two humps \x{1F42B}"; dd( JSON::XS->new->pretty(1)->encode( \@humps ) ); dd( JSON::PP->new->pretty(1)->encode( \@humps ) ); dd( JSON::XS->new->utf8(1)->pretty(1)->encode( \@humps ) ); dd( JSON::PP->new->utf8(1)->pretty(1)->encode( \@humps ) ); dd( JSON::XS->new->ascii(1)->pretty(1)->encode( \@humps ) ); dd( JSON::PP->new->ascii(1)->pretty(1)->encode( \@humps ) ); print "#" x 6, "\n"; path( 'deleteme.txt')->spew_raw( JSON::XS->new->pretty(1)->encode( \@h +umps ) ); dd( path( 'deleteme.txt')->slurp_raw ); path( 'deleteme.txt')->spew_raw( JSON::PP->new->pretty(1)->encode( \@h +umps ) ); dd( path( 'deleteme.txt')->slurp_raw ); print "#" x 6, "\n"; path( 'deleteme.txt')->spew_utf8( JSON::XS->new->pretty(1)->encode( \@ +humps ) ); dd( path( 'deleteme.txt')->slurp_raw ); dd( path( 'deleteme.txt')->slurp_utf8 ); path( 'deleteme.txt')->spew_utf8( JSON::PP->new->pretty(1)->encode( \@ +humps ) ); dd( path( 'deleteme.txt')->slurp_raw ); dd( path( 'deleteme.txt')->slurp_utf8 ); print "#" x 6, "\n"; path( 'deleteme.txt')->spew_utf8( JSON::XS->new->utf8(1)->pretty(1)->e +ncode( \@humps ) ); dd( path( 'deleteme.txt')->slurp_utf8 ); path( 'deleteme.txt')->spew_utf8( JSON::PP->new->utf8(1)->pretty(1)->e +ncode( \@humps ) ); dd( path( 'deleteme.txt')->slurp_utf8 ); print "#" x 6, "\n"; path( 'deleteme.txt')->spew_utf8( JSON::XS->new->ascii(1)->pretty(1)-> +encode( \@humps ) ); dd( path( 'deleteme.txt')->slurp_utf8 ); path( 'deleteme.txt')->spew_utf8( JSON::PP->new->ascii(1)->pretty(1)-> +encode( \@humps ) ); dd( path( 'deleteme.txt')->slurp_utf8 ); print "#" x 6, "\n"; path( 'deleteme.txt')->spew_raw( JSON::XS->new->utf8(1)->pretty(1)->en +code( \@humps ) ); dd( path( 'deleteme.txt')->slurp_utf8 ); path( 'deleteme.txt')->spew_raw( JSON::PP->new->utf8(1)->pretty(1)->en +code( \@humps ) ); dd( path( 'deleteme.txt')->slurp_utf8 ); # path( 'deleteme.txt')->remove; __END__ "[\n \"\x{FEFF}\x{1F42A} one hump two humps \x{1F42B}\"\n]\n" "[\n \"\x{FEFF}\x{1F42A} one hump two humps \x{1F42B}\"\n]\n" "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" "[\n \"\\ufeff\\ud83d\\udc2a one hump two humps \\ud83d\\udc2b\"\n]\ +n" "[\n \"\\ufeff\\ud83d\\udc2a one hump two humps \\ud83d\\udc2b\"\n]\ +n" ###### Wide character in print at C:/perl/site/lib/Path/Tiny.pm line 1848. "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" Wide character in print at C:/perl/site/lib/Path/Tiny.pm line 1848. "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" ###### "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" "[\n \"\x{FEFF}\x{1F42A} one hump two humps \x{1F42B}\"\n]\n" "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" "[\n \"\x{FEFF}\x{1F42A} one hump two humps \x{1F42B}\"\n]\n" ###### "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" ###### "[\n \"\\ufeff\\ud83d\\udc2a one hump two humps \\ud83d\\udc2b\"\n]\ +n" "[\n \"\\ufeff\\ud83d\\udc2a one hump two humps \\ud83d\\udc2b\"\n]\ +n" ###### "[\n \"\x{FEFF}\x{1F42A} one hump two humps \x{1F42B}\"\n]\n" "[\n \"\x{FEFF}\x{1F42A} one hump two humps \x{1F42B}\"\n]\n" "🐪 one hump two humps 🐫"
      "🐪 one hump two humps 🐫"
        I have no idea what I am supposed to make of that example.
      After some more doc-diving, I discovered the problem is that if you use encode_json (which is equivalent to $json_text = JSON::XS->new->utf8->encode ($perl_scalar) AND set the file encoding to utf8, then text gets double-encoded. When I do this
      open(OUT,">twitter-non-en.json") or die "Can't open output: $!"; #binmode OUT, ':encoding(UTF-8)'; print OUT encode_json($sample); close OUT;
      The JSON contains the expected Cyrillic text. Thanks for all the input.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11130038]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (1)
As of 2024-04-25 00:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found