Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Yes indeed it should be UTF-8 but we've already established that the file has "strange" (being polite) encoding...

$| = 1; $/ = undef; print "Reading JSON file"; open my $fh, '<:encoding(UTF-8)', '../data/publicextract.charity.json' + or die "Unable to read Charity JSON File"; my $data = <$fh>; close $fh; print "...done\n"; $data =~ s/^\x{feff}//; # Strip off BOM print "Decoding JSON file"; my $js = decode_json $data; # line 24 print "...done\n";

This takes about 10 minutes to read the 462Mb JSON file then fails with Decoding JSON fileWide character in subroutine entry at import.pl line 24

Given the time taken to open the file in UTF-8 and the error, I am thinking there is some nasty encoding hidden somewhere in this file

UPDATE

Changing the encoding like so

print "Reading JSON file"; open my $fh, '<', '../data/publicextract.charity.json' or die "Unable +to read Charity JSON File"; my $data = <$fh>; close $fh; print "...done\n"; $data =~ s/^\357\273\277//; # Strip off BOM
takes about 5 minutes to open the file but gives the strange error Decoding JSON fileKilled

"Strange" because the error doesn't include at import.pl line 24!

Another UPDATE

It seems I might be running out of memory...380400 records in the JSON file seems to be too much...


In reply to Re^10: Rogue character(s) at start of JSON file (BOM; dumping references) by Bod
in thread Rogue character(s) at start of JSON file by Bod

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-19 04:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found