http://qs321.pair.com?node_id=11115241

bliako has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am struggling to make either of Data::Dump or Data::Dumper to print rendered(?) unicode characters rather than those ugly escapes but I can't seem to succeed. Perl prints them nicely but dump and dumper escape.

use utf8; binmode STDOUT, ':encoding(UTF-8)'; use Data::Dumper; use Data::Dump qw/pp/; my $pv = {'&#945;&#946;&#947;' => '&#967;&#968;&#950;'}; #<<<proper gr +eek key and value print pp($pv)."\n"; print Dumper($pv); print "XX:'".$pv->{'&#945;&#946;&#947;'}."'\n"; # proper greek nicely +printed # madness: { "\x{3B1}\x{3B2}\x{3B3}" => "\x{3C7}\x{3C8}\x{3B6}" } $VAR1 = { "\x{3b1}\x{3b2}\x{3b3}" => "\x{3c7}\x{3c8}\x{3b6}" }; # nicely printed XX:'&#967;&#968;&#950;' #<<<< that's proper greek

thanks, bliako

Edit: sorry, I did not mention JSON (thanks haukex for reminding me). What I am trying to do is to visualise a long JSON by converting it to a Perl var and then possibly edit the perl var, and finally save back to JSON (with the changes). So, yes, actually I am serialising and de-serialising but I can't seem to find an ascii-text-based, unicode-friendly serialiser other than Dump and Dumper. And for me, YAML is too tiring with all that spaces. Or I am just used to nested Perl data.

So, the input and output are JSON. Long JSON with unicode. I want to edit that JSON too. But it's too cumbersome in a text-based editor. And so I prefer to covert JSON to Perl, edit the Perl and then convert back to JSON. My procedure/tool was working until some unicode broke it.

Replies are listed 'Best First'.
Re: Convert JSON to Perl and back with unicode
by Corion (Patriarch) on Apr 09, 2020 at 10:02 UTC

    After some source diving into the (Perl) source of the pure Perl implementation of Data::Dumper, and some confusion, because usually Perl uses the XS version of Data::Dumper, which behaves differently in this case, I found the following approach works. It monkey-patches Data::Dumper (instead of inheriting from it, because Data::Dumper isn't written that way):

    #!perl use strict; use warnings; use utf8; use charnames ':full'; # just so I can use ASCII to represent Greek le +tters use Data::Dumper; $Data::Dumper::Useperl = 1; $Data::Dumper::Useqq='utf8'; sub Data::Dumper::qquote { local($_) = shift; s/([\\\"\@\$])/\\$1/g; return qq("$_") unless /[[:^print:]]/; # fast exit if only printabl +es # Here, there is at least one non-printable to output. First, trans +late the # escapes. s/([\a\b\t\n\f\r\e])/$Data::Dumper::esc{$1}/g; # no need for 3 digits in escape for octals not followed by a digit. s/($Data::Dumper::low_controls)(?!\d)/'\\'.sprintf('%o',ord($1))/eg; # But otherwise use 3 digits s/($Data::Dumper::low_controls)/'\\'.sprintf('%03o',ord($1))/eg; # all but last branch below not supported --BEHAVIOR SUBJECT TO CH +ANGE-- my $high = shift || ""; if ($high eq "iso8859") { # Doesn't escape the Latin1 printables if ($Data::Dumper::IS_ASCII) { s/([\200-\240])/'\\'.sprintf('%o',ord($1))/eg; } elsif ($] ge 5.007_003) { my $high_control = utf8::unicode_to_native(0x9F); s/$high_control/sprintf('\\%o',ord($1))/eg; } } elsif ($high eq "utf8") { # Some discussion of what to do here is in # https://rt.perl.org/Ticket/Display.html?id=113088 # use utf8; # $str =~ s/([^\040-\176])/sprintf "\\x{%04x}", ord($1)/ge; } elsif ($high eq "8bit") { # leave it as it is } else { s/([[:^ascii:]])/'\\'.sprintf('%03o',ord($1))/eg; #s/([^\040-\176])/sprintf "\\x{%04x}", ord($1)/ge; } return qq("$_"); }; my $pv = { "\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER BETA}" => "\N{G +REEK CAPITAL LETTER GAMMA}\N{GREEK CAPITAL LETTER DELTA}" }; print Dumper $pv;
Re: Convert JSON to Perl and back with unicode
by haukex (Archbishop) on Apr 08, 2020 at 22:45 UTC

    Yes, those modules do that, see Re^3: Handling utf-8 characters when scraping and Re^3: Data::Dumper with unicode. You could try Data::Printer, although as I mention in those links, you may miss differences in whitespace characters or others, such as the difference between ä and ä.

    You mention JSON in the title, what does this have to do with it? You definitely shouldn't use any of the above modules for JSON, even if it's hinted in one or two places, because they are primarily debugging tools, and less serialization modules (even though they're sometimes used as such, even by myself). Use JSON::MaybeXS...

      thanks for the suggestions. I mentioned JSON as what I am trying to do is to visualise a long JSON by converting it to a Perl var and then possibly edit the perl var, and finally save back to JSON (with the changes). But then my post got reduced to that minimum example without JSON. So, yes, actually I am serialising and de-serialising but I can't seem to find an ascii-text-based, unicode-friendly serialiser other than Dump and Dumper. Is the output of Data::Printer good for reading it back as a Perl var? I could not see it does that in the doc. I will edit my post to add this paragraph to it.

        Is the output of Data::Printer good for reading it back as a Perl var?

        No, I don't believe it is - hence my comment about those modules being mostly debugging tools.

        what I am trying to do is to visualise a long JSON by converting it to a Perl var and then possibly edit the perl var, and finally save back to JSON (with the changes)

        You might give YAML a try, such as tinita's YAML::PP, that should be better for round-tripping and more readable than JSON.

        Update: I see that you already commented on YAML in your edit of the root node. Sorry, but the dumper modules aren't going to give you the behavior you want, so you really might want to reconsider YAML. Otherwise, unless there's a CPAN module that I've missed, if you really, really want Perl, you could roll your own, but that comes with so many caveats I can't reccomend it. Or try a JSON pretty printer to make it more readable...

Re: Convert JSON to Perl and back with unicode
by kcott (Archbishop) on Apr 09, 2020 at 10:33 UTC

    G'day bliako,

    I'm assuming that by "long JSON" you're referring the JSON with most of the whitespace removed which, I agree, can be almost impossible to read. In order to make this more readable, you could use a formatter. There's several free ones available; I have "JSON Formatter and Validator bookmarked — I do use it a fair bit but mostly for the validation functionality.

    If by "edit the Perl" you're talking about modifying the Perl data structure programmatically, you could use something along the following lines.

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    use autodie;
    use utf8;
    
    use JSON;
    
    my $json_in = 'pm_11115241_uni_greek.json';
    my $json_out = 'pm_11115241_uni_greek_edit.json';
    
    _print_json_file($json_in);
    my $json_text = read_json($json_in);
    my $perl_ref = decode_json $json_text;
    _print_perl_json($perl_ref);
    edit_perl_json($perl_ref);
    _print_perl_json($perl_ref);
    write_json($perl_ref, $json_out);
    _print_json_file($json_out);
    
    sub read_json {
        my ($file) = @_;
    
        open my $fh, '<', $file;
        local $/;
        return <$fh>;
    }
    
    sub write_json {
        my ($perl, $file) = @_;
    
        my $json_text = JSON->new->pretty->encode($perl);
    
        open my $fh, '>:encoding(UTF-8)', $file;
        print $fh $json_text;
    }
    
    sub edit_perl_json {
        my ($perl) = @_;
    
        my $greek_key = 'ΙΚΛΜΝΞΟΠ';
        my $greek_val = 'ικλμνξοπ';
    
        $perl->{$greek_key} = $greek_val;
    }
    
    sub _print_json_file {
        my ($file) = @_;
    
        print "*** Contents of '$file' ***\n";
    
        system cat => $file;
    }
    
    sub _print_perl_json {
        my ($perl) = @_;
    
        print "*** Perl from JSON ***\n";
    
        use open OUT => qw{:encoding(UTF-8) :std};
    
        for (sort keys %$perl) {
            print $_, ' = ', $perl->{$_}, "\n";
        }
    }
    

    Here's a sample run:

    $ ./pm_11115241_uni_json_perl.pl
    *** Contents of 'pm_11115241_uni_greek.json' ***
    {
        "ΑΒΓΔΕΖΗΘ" : "αβγδεζηθ"
    }
    *** Perl from JSON ***
    ΑΒΓΔΕΖΗΘ = αβγδεζηθ
    *** Perl from JSON ***
    ΑΒΓΔΕΖΗΘ = αβγδεζηθ
    ΙΚΛΜΝΞΟΠ = ικλμνξοπ
    *** Contents of 'pm_11115241_uni_greek_edit.json' ***
    {
       "ΑΒΓΔΕΖΗΘ" : "αβγδεζηθ",
       "ΙΚΛΜΝΞΟΠ" : "ικλμνξοπ"
    }
    

    Although <code> tags are generally preferred for code and output, when dealing with Unicode, <pre> tags will not convert your characters to HTML entities. For inline, as opposed to block, markup, I use <tt> tags for the same purpose.

    — Ken

Re: Convert JSON to Perl and back with unicode
by leszekdubiel (Scribe) on Apr 09, 2020 at 21:49 UTC
    I am struggling to make either of Data::Dump or Data::Dumper to print rendered(?) unicode characters rather than those ugly escapes but I can't seem to succeed.

    Normally I use:

    #!/usr/bin/perl -CSDA use utf8; use Modern::Perl qw{2017}; use Data::Dumper; print STDERR Dumper(\%mydatatoshow) =~ s/\\x\{([0-9a-f]{2,})\}/chr(hex +($1))/ger;
      Nice and easy! But, it's prone to Unicode injection:
      #! /usr/bin/perl use warnings; use strict; use Data::Dumper; my %mydatatoshow = ( "\N{GREEK SMALL LETTER ALPHA}" => "\\\N{GREEK SMALL LETTER BETA}", '\\x{3b1}' => '\\\\x{3b2}', ); binmode STDERR, ':encoding(UTF-8)'; print STDERR Dumper(\%mydatatoshow); print STDERR Dumper(\%mydatatoshow) =~ s/\\x\{([0-9a-f]{2,})\}/chr(hex +($1))/ger;

      Update: Possibly fixable by

      s/((\\+)x\{([0-9a-f]{2,})\})/ (length($2) % 2) ? substr($2, 1) . chr hex $3 : $1/ger;

      Verified by

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Convert JSON to Perl and back with unicode
by Anonymous Monk on Apr 09, 2020 at 05:51 UTC
    JSON->new->utf8(1)->pretty->encode( $perl_scalar );

    $perl_scalar = $json->decode($json_text)

Re: Convert JSON to Perl and back with unicode
by 1nickt (Canon) on Apr 10, 2020 at 16:30 UTC