Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

JSON::XS Cyrillic unicode not saving properly

by cormanaz (Deacon)
on Mar 20, 2021 at 23:01 UTC ( [id://11130018]=perlquestion: print w/replies, xml ) Need Help??

cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Hi all. I have some Twitter data in Cyrillic stored in a database. They show up as Cyrillic chars in pgAdmin. I am trying to extract some records and save them to a .json file, like so:
#!/usr/bin/perl -w use JSON::XS; use Lingua::Identify qw(:language_identification); #use lib qw(/home/corman/perlmodules); #use SqlSupport; my $dbh = connectpgdb('****','****','****','Pg','localhost'); my @items = getsqlcol($dbh,"select tweet_text from twitter order by ra +ndom() limit 10000"); my $sample; my $idx = 0; foreach my $tweet (@items) { my $lang = langof($tweet); if ($lang =~ /ru|bg|uk/) { $sample->[$idx]->{text} = $tweet; $sample->[$idx]->{lang} = $lang; $sample->[$idx]->{len} =length($tweet); $idx++; } } print "$idx items\n"; open(OUT,">twitter-non-en.json") or die "Can't open output: $!"; binmode OUT, ':utf8'; print OUT encode_json($sample); close OUT; sub connectpgdb { # this is used to connect with DBD::Pg my ($database,$user,$password,$driver,$server) = @_; my $url = "DBI:$driver:dbname=$database;host=$server;port=5432"; my $dbh = DBI->connect( $url, $user, $password,{AutoCommit=>1,Rais +eError=>1,PrintError=>0}) or die "connectdb can't connect to psql: $! +\n"; return $dbh; } sub getsqlcol { my ($dbh,$sqlstatement)= @_; my @results = (); my $sth = $dbh->prepare($sqlstatement); my @col; $sth->execute || die "Could not execute MySQL statement: $sqlstate +ment"; while (@col=$sth->fetchrow_array) { push(@results,$col[0]); } return @results; }
When I open the resulting .json in Firefox to inspect, the text fields are not Cyrillic but look like this: УдаÑ\u…\u0081номÑ\u0083. I have this problem not just with JSON::XS but when saving to .txt files, etc. What am I doing wrong?

Replies are listed 'Best First'.
Re: JSON::XS Cyrillic unicode not saving properly
by choroba (Cardinal) on Mar 20, 2021 at 23:46 UTC
    You don't seem to use pg_enable_utf8.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      I added $dbh->{pg_enable_utf8} = 1; and am getting the same result. Is that the correct way to do it?
        If you're using a recent version of DBD::Pg, you shouldn't need to set the parameter at all. The only way how I was able to get the wrong string back was to set it to 0 but not set the binmode of STDOUT. Try experimenting with the following script:
        #!/usr/bin/perl
        use warnings;
        use strict;
        use utf8;
        use feature qw{ say };
        
        use DBI;
        use Encode;
        
        for my $utf8 (1, -1, 0) {
            my $string = $utf8 ? 'Кирилл цагаан толгой'
                               : encode('UTF-8', 'Кирилл цагаан толгой');
        
            my $db = 'DBI'->connect('dbi:Pg:dbname=postgres', "", "");
            say $db->{pg_enable_utf8} = $utf8;
        
            $db->do('CREATE TABLE IF NOT EXISTS cyr (t TEXT)');
            $db->do('DELETE FROM cyr');
        
            my $insert = $db->prepare('INSERT INTO cyr (t) VALUES (?)');
            $insert->execute($string);
        
            my $select = $db->prepare('SELECT t FROM cyr');
            $select->execute;
            binmode *STDOUT, $utf8 ? ':encoding(UTF-8)' : ':raw';
            while (my @row = $select->fetchrow_array) {
                say @row;
            }
        }
        

        The script needs to be saved as UTF-8.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11130018]
Approved by kcott
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (8)
As of 2024-03-28 09:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found