Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^9: Rogue character(s) at start of JSON file (BOM; dumping references)

by LanX (Saint)
on Jan 19, 2023 at 19:57 UTC ( [id://11149704]=note: print w/replies, xml ) Need Help??


in reply to Re^8: Rogue character(s) at start of JSON file (BOM; dumping references)
in thread Rogue character(s) at start of JSON file

> decode_json

  • encode_json

    Converts the given Perl data structure to a UTF-8 encoded, binary string (that is, the string contains octets only) ...

  • decode_json

    ... The opposite of encode_json: expects an UTF-8 (binary) string and tries to parse that as an UTF-8 encoded JSON text ...

Cheers Rolf
(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
Wikisyntax for the Monastery

  • Comment on Re^9: Rogue character(s) at start of JSON file (BOM; dumping references)

Replies are listed 'Best First'.
Re^10: Rogue character(s) at start of JSON file (BOM; dumping references)
by Bod (Parson) on Jan 19, 2023 at 21:11 UTC

    Yes indeed it should be UTF-8 but we've already established that the file has "strange" (being polite) encoding...

    $| = 1; $/ = undef; print "Reading JSON file"; open my $fh, '<:encoding(UTF-8)', '../data/publicextract.charity.json' + or die "Unable to read Charity JSON File"; my $data = <$fh>; close $fh; print "...done\n"; $data =~ s/^\x{feff}//; # Strip off BOM print "Decoding JSON file"; my $js = decode_json $data; # line 24 print "...done\n";

    This takes about 10 minutes to read the 462Mb JSON file then fails with Decoding JSON fileWide character in subroutine entry at import.pl line 24

    Given the time taken to open the file in UTF-8 and the error, I am thinking there is some nasty encoding hidden somewhere in this file

    UPDATE

    Changing the encoding like so

    print "Reading JSON file"; open my $fh, '<', '../data/publicextract.charity.json' or die "Unable +to read Charity JSON File"; my $data = <$fh>; close $fh; print "...done\n"; $data =~ s/^\357\273\277//; # Strip off BOM
    takes about 5 minutes to open the file but gives the strange error Decoding JSON fileKilled

    "Strange" because the error doesn't include at import.pl line 24!

    Another UPDATE

    It seems I might be running out of memory...380400 records in the JSON file seems to be too much...

      unfortunately° are - like already said - decode_json and encode_json defaulting to utf8-encoded octet-streams

      you need to use the object interface

      use v5.12; use warnings; use Devel::Peek; use Data::Dump qw/pp/; use JSON::XS; use utf8; my $str = "\x{feff}" . '["whät","över"]'; # Internal Unicode b/c of use +utf8 Dump($str); # hence shows UTF8 flag $str =~ s/^\x{feff}//; # strip BOM by Unicode code-po +int Dump($str); # shows UTF8 flag my $JSON = JSON::XS->new; # coder for all unicode in/out my $data = $JSON->decode($str); warn pp '$data: ', $data; # ["wh\xE4t", "\xF6ver"] # NB: \xE4, \xF6 correct codep +oints for umlauts # even if each character neede +d 2 bytes my $str2 = $JSON->encode($data); # roundtrip Dump($str2); # shows UTF8 flag warn '$str eq $str2: ', $str eq $str2; # same

      SV = PV(0x6bcea8) at 0x26485f0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x28229e8 "\357\273\277[\"wh\303\244t\",\"\303\266ver\"]"\0 [UT +F8 "\x{feff}["wh\x{e4}t","\x{f6}ver"]"] CUR = 20 LEN = 22 SV = PV(0x6bcea8) at 0x26485f0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x28227e8 "[\"wh\303\244t\",\"\303\266ver\"]"\0 [UTF8 "["wh\x{e +4}t","\x{f6}ver"]"] CUR = 17 LEN = 24 ("\$data: ", ["wh\xE4t", "\xF6ver"]) at d:/perl/pm/t_devel_peek.pl lin +e 21. SV = PV(0x6bd158) at 0x27482a0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x281d9c8 "[\"wh\303\244t\",\"\303\266ver\"]"\0 [UTF8 "["wh\x{e +4}t","\x{f6}ver"]"] CUR = 17 LEN = 66 $str eq $str2: 1 at d:/perl/pm/t_devel_peek.pl line 28.

      Cheers Rolf
      (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
      Wikisyntax for the Monastery

      °) or fortunately? depends on the perspective

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11149704]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (11)
As of 2024-04-18 16:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found