Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

malformed UTF-8 character in JSON string in perl

by Yllar (Novice)
on Aug 11, 2015 at 11:58 UTC ( [id://1138138]=perlquestion: print w/replies, xml ) Need Help??

Yllar has asked for the wisdom of the Perl Monks concerning the following question:

use utf8; use Encode qw(encode_utf8); use JSON; use Data::Dumper; my $data = qq( { "cat" : "text – abcd" } ); my $json_data = encode_utf8( $data ); my $perl_hash = decode_json( $json_data ); print Dumper($perl_hash);

I am getting following error when executing the code.

$VAR1 = { 'cat' => "text \x{2013} abcd" };

I need the output like "text – abcd". Is thr any module other than(Text::Unidecode), or a method of converting these characters like ',",-,.,? to simple ASCII characters? Any help from you guys would be appreciated greatly.

Replies are listed 'Best First'.
Re: malformed UTF-8 character in JSON string in perl
by Corion (Patriarch) on Aug 11, 2015 at 12:13 UTC

    What's wrong with Text::Unidecode?

    Also, why/how is that UTF-8 character malformed? Are you sure that your script is in the correct encoding for your stanza to work? I prefer to be explicit when using characters above 127 in my scripts and use \N{PILE OF POO} instead of inserting the character verbatim into the source code in the hope that it won't get mangled.

    Upon rereading your post, if you use Data::Dumper, the escaping of the output string is a feature. Maybe you can be more explicit in what output you get and what output you really want, and whether the output should be ASCII or UTF-8, and whether the input should be ASCII or UTF-8.

      There is nothing wrong with Text::Unidecode, but I dont want to use this module. I would like to decode the unicode chars like single quotes, hypen, double quotes,bullets from my input data to ASCII chars.

      Can you please print the below two statements and see the outputs.

      1) "text - abcd" 2) "text – abcd"

      when you print you will get the same out for 1st one. But for second one you will the output like "text ΤΗτ abcd" which I do not want.

        There is nothing wrong with Text::Unidecode, but I dont want to use this module.
        Then you are SOL, aren't you? This is the easiest way to accomplish what you're after...So what? Is this homework?
Re: malformed UTF-8 character in JSON string in perl
by Laurent_R (Canon) on Aug 11, 2015 at 12:47 UTC
    Your problem does not seem to be related to JSON. I get the same without using JSON:
    $ perl -E ' use utf8; use Data::Dumper; my $data = qq( { "cat" : "text – abcd" } ); print Dumper($data); ' $VAR1 = " { \"cat\" : \"text \x{2013} abcd\" } ";
    If I do not use Data::Dumper, I obtain this:
    $ perl -E 'use utf8; my $data = qq( { "cat" : "text – abcd" } ); say $data ' Wide character in say at -e line 4. { "cat" : "text – abcd" }
    i.e. a warning but the right output. As previously mentioned by Corion, it is a feature of Data::Dumper to display the UTF-8 escape sequences.

    Using binmode, as I already told you several days ago in an answer to your previous post with the same content, I no longer have any warning:

    $ perl -E 'use utf8; my $data = qq( { "cat" : "text – abcd" } ); binmode STDOUT, ":utf8"; say $data; ' { "cat" : "text – abcd" }
    Have you tried binmode?

      I am not sure how you did you get the output "{ "cat" : "text – abcd" }". Just now I tried with binmode but I've got the same error again(see below).

      text ΤΗτ abcd

      Would you please try this code and confirm again please.
      use strict; use warnings; use utf8; use JSON; my $data = "text – abcd"; binmode STDOUT, ":utf8"; print "$data";

      please remember the char mentioned in the statement ("text – abcd")is not a regular dash.

        I know it is not a regular dash, and I didn't have a regular dash in my 3 examples above, as it can be seen in the first one with Data Dumper (showing the escape sequence), and also in the second one displaying the warning about wide character.

        This is the same example with first one regular dash and then an "irregular" one, using first Data Dumper to show the UTF-8 escape sequence on the irregular dash, and showing what I get with binmode:

        $ perl -e ' use strict; use warnings; use utf8; use Data::Dumper; my $data = "regular dash - other type of beast: – abcd"; print Dumper $data; binmode STDOUT, ":utf8"; print "$data"; ' $VAR1 = "regular dash - other type of beast: \x{2013} abcd"; regular dash - other type of beast: – abcd
        If this does not work for you and it does for me, I would suspect that either your version of Perl is too old (I am using 5.14) or that there is something wrong in your terminal configuration.
        binmode STDOUT, ":utf8";
        But is your terminal in UTF-8 mode? Try running perl yourscript.pl > utf8.txt and opening utf8.txt as UTF-8 text file. If you need to output Unicode characters to terminal, try Encode::Locale and binmode STDOUT, ":encoding(console_out)" (assuming that your terminal uses encoding which does have these Unicode characters; on Windows you may want to run chcp 65001 first).
Re: malformed UTF-8 character in JSON string in perl
by ikegami (Patriarch) on Aug 11, 2015 at 14:50 UTC

    First of all,

    decode_json(encode_utf8($json))
    is silly since it's effectively
    from_json(decode_utf8(encode_utf8($json)))

    That's why we told you should be using from_json.

    use utf8; use JSON qw( from_json ); my $json = q({ "cat" : "text – abcd" }); my $data = from_json($json);

    As for your new question, U+2013 EN DASH ("–") isn't found in ASCII's character set, so it can't be encoded using ASCII.

    But unless your terminal uses ASCII, there's no reason to convert to ASCII. All you need to do is tell Perl what encoding your terminal expects. You can either encode explicitly, or you can use the "open" pragma as follows:

    use open ':std', ':encoding(UTF-8)'; print($data->{cat}, "\n");

    The whole script:

    #!/usr/bin/perl use strict; use warnings; use utf8; # Source encoded using UTF-8. use open ':std', ':encoding(UTF-8)'; # Terminal expects UTF-8. use JSON qw( from_json ); my $json = q({ "cat" : "text – abcd" }); my $data = from_json($json); print($data->{cat}, "\n");

      Hi ikegami ,

      Thanks for your help. I just executed your script, unfortunately it is giving the same output("text ΤΗτ abcd")

        I repeat:

        All you need to do is tell Perl what encoding your terminal expects.

        You specified the wrong encoding. You told Perl your terminal expects UTF-8, but that's wrong. Use the right encoding!

Re: malformed UTF-8 character in JSON string in perl
by Anonymous Monk on Aug 11, 2015 at 12:21 UTC

    This post is identical (except formatting) to your last post Malformed UTF-8 character from last week. Please read and respond to the replies there instead of re-posting.

      His last post was actually quite different when the answers were posted. The answers there do not answer is is current question.
Re: malformed UTF-8 character in JSON string in perl
by tangent (Parson) on Aug 11, 2015 at 14:49 UTC
    Is [there] any module other than Text::Unidecode, or a method of converting these characters like ',",-,.,? to simple ASCII characters?
    Why do you not want to use Text::Unidecode? It does exactly what you are requesting:
    use utf8; use JSON; use Encode qw(encode_utf8); use Text::Unidecode; my $data = qq( { "cat" : "text – abcd “ ’ ” ‘" } ); my $json_data = encode_utf8( $data ); my $perl_hash = decode_json( $json_data ); while ( my($k,$v) = each %$perl_hash ) { unidecode($v); print "$k => $v\n"; }
    Output:
    cat => text - abcd " ' " '

      Hi Tangent

      In below statements which statement did you use in your code?

      1) "text - abcd “

      2) "text – abcd “

      would you please try with the second statement and see what output you are getting?

        I used the second one. It might be easier to use the names - in the following I input plain dash "-", en dash "–" and em dash "—" and the output is plain dash, plain dash, double plain dash:
        "text - – —" input "text - - --" output

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1138138]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2024-04-24 18:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found