Re: malformed UTF-8 character in JSON string in perl
by Corion (Patriarch) on Aug 11, 2015 at 12:13 UTC
|
What's wrong with Text::Unidecode?
Also, why/how is that UTF-8 character malformed? Are you sure that your script is in the correct encoding for your stanza to work? I prefer to be explicit when using characters above 127 in my scripts and use \N{PILE OF POO} instead of inserting the character verbatim into the source code in the hope that it won't get mangled.
Upon rereading your post, if you use Data::Dumper, the escaping of the output string is a feature. Maybe you can be more explicit in what output you get and what output you really want, and whether the output should be ASCII or UTF-8, and whether the input should be ASCII or UTF-8.
| [reply] [d/l] |
|
There is nothing wrong with Text::Unidecode, but I dont want to use this module. I would like to decode the unicode chars like single quotes, hypen, double quotes,bullets from my input data to ASCII chars.
Can you please print the below two statements and see the outputs.
1) "text - abcd"
2) "text abcd"
when you print you will get the same out for 1st one. But for second one you will the output like "text ΤΗτ abcd" which I do not want.
| [reply] |
|
There is nothing wrong with Text::Unidecode, but I dont want to use this module.
Then you are SOL, aren't you? This is the easiest way to accomplish what you're after...So what? Is this homework?
| [reply] |
Re: malformed UTF-8 character in JSON string in perl
by Laurent_R (Canon) on Aug 11, 2015 at 12:47 UTC
|
Your problem does not seem to be related to JSON. I get the same without using JSON:
$ perl -E '
use utf8;
use Data::Dumper;
my $data = qq( { "cat" : "text abcd" } );
print Dumper($data);
'
$VAR1 = " { \"cat\" : \"text \x{2013} abcd\" } ";
If I do not use Data::Dumper, I obtain this:
$ perl -E 'use utf8;
my $data = qq( { "cat" : "text abcd" } );
say $data
'
Wide character in say at -e line 4.
{ "cat" : "text abcd" }
i.e. a warning but the right output. As previously mentioned by Corion, it is a feature of Data::Dumper to display the UTF-8 escape sequences.
Using binmode, as I already told you several days ago in an answer to your previous post with the same content, I no longer have any warning:
$ perl -E 'use utf8;
my $data = qq( { "cat" : "text abcd" } );
binmode STDOUT, ":utf8";
say $data;
'
{ "cat" : "text abcd" }
Have you tried binmode? | [reply] [d/l] [select] |
|
use strict;
use warnings;
use utf8;
use JSON;
my $data = "text abcd";
binmode STDOUT, ":utf8";
print "$data";
please remember the char mentioned in the statement ("text abcd")is not a regular dash. | [reply] [d/l] |
|
$ perl -e '
use strict;
use warnings;
use utf8;
use Data::Dumper;
my $data = "regular dash - other type of beast: abcd";
print Dumper $data;
binmode STDOUT, ":utf8";
print "$data";
'
$VAR1 = "regular dash - other type of beast: \x{2013} abcd";
regular dash - other type of beast: abcd
If this does not work for you and it does for me, I would suspect that either your version of Perl is too old (I am using 5.14) or that there is something wrong in your terminal configuration.
| [reply] [d/l] |
|
|
binmode STDOUT, ":utf8";
But is your terminal in UTF-8 mode? Try running perl yourscript.pl > utf8.txt and opening utf8.txt as UTF-8 text file. If you need to output Unicode characters to terminal, try Encode::Locale and binmode STDOUT, ":encoding(console_out)" (assuming that your terminal uses encoding which does have these Unicode characters; on Windows you may want to run chcp 65001 first).
| [reply] [d/l] [select] |
Re: malformed UTF-8 character in JSON string in perl
by ikegami (Patriarch) on Aug 11, 2015 at 14:50 UTC
|
decode_json(encode_utf8($json))
is silly since it's effectively
from_json(decode_utf8(encode_utf8($json)))
That's why we told you should be using from_json.
use utf8;
use JSON qw( from_json );
my $json = q({ "cat" : "text abcd" });
my $data = from_json($json);
As for your new question, U+2013 EN DASH ("–") isn't found in ASCII's character set, so it can't be encoded using ASCII.
But unless your terminal uses ASCII, there's no reason to convert to ASCII. All you need to do is tell Perl what encoding your terminal expects. You can either encode explicitly, or you can use the "open" pragma as follows:
use open ':std', ':encoding(UTF-8)';
print($data->{cat}, "\n");
The whole script:
#!/usr/bin/perl
use strict;
use warnings;
use utf8; # Source encoded using UTF-8.
use open ':std', ':encoding(UTF-8)'; # Terminal expects UTF-8.
use JSON qw( from_json );
my $json = q({ "cat" : "text abcd" });
my $data = from_json($json);
print($data->{cat}, "\n");
| [reply] [d/l] [select] |
|
| [reply] |
|
I repeat:
All you need to do is tell Perl what encoding your terminal expects.
You specified the wrong encoding. You told Perl your terminal expects UTF-8, but that's wrong. Use the right encoding!
| [reply] |
Re: malformed UTF-8 character in JSON string in perl
by Anonymous Monk on Aug 11, 2015 at 12:21 UTC
|
This post is identical (except formatting) to your last post Malformed UTF-8 character from last week. Please read and respond to the replies there instead of re-posting.
| [reply] |
|
His last post was actually quite different when the answers were posted. The answers there do not answer is is current question.
| [reply] |
Re: malformed UTF-8 character in JSON string in perl
by tangent (Parson) on Aug 11, 2015 at 14:49 UTC
|
Is [there] any module other than Text::Unidecode, or a method of converting these characters like ',",-,.,? to simple ASCII characters?
Why do you not want to use Text::Unidecode? It does exactly what you are requesting:
use utf8;
use JSON;
use Encode qw(encode_utf8);
use Text::Unidecode;
my $data = qq( { "cat" : "text abcd " } );
my $json_data = encode_utf8( $data );
my $perl_hash = decode_json( $json_data );
while ( my($k,$v) = each %$perl_hash ) {
unidecode($v);
print "$k => $v\n";
}
Output:
cat => text - abcd " ' " '
| [reply] [d/l] [select] |
|
| [reply] |
|
I used the second one. It might be easier to use the names - in the following I input plain dash "-", en dash "–" and em dash "—" and the output is plain dash, plain dash, double plain dash:
"text - " input
"text - - --" output
| [reply] [d/l] |
|
|
|
|
|