Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Rogue character(s) at start of JSON file

by Bod (Parson)
on Jan 16, 2023 at 22:01 UTC ( [id://11149618]=perlquestion: print w/replies, xml ) Need Help??

Bod has asked for the wisdom of the Perl Monks concerning the following question:

I'm processing some JSON files using JSON and getting this error:

malformed JSON string, neither tag, array, object, number, string or a +tom, at character offset 0 (before "\x{feff}[{"registered...")
So I printed out the JSON file from the Perl script and sure enough there are three rogue characters before the opening square bracket. These do not show up in my text editor TextPad.

A search has found this explanation. However, the JSON files are being pulled from a UK Government data source and I have no control over how they are made. So I have to deal with the character(s) somehow.

Here's my test code:

use strict; use warnings; use JSON; use Data::Dumper; $/ = undef; open my $fh, '<', 'charity.json'; my $data = <$fh>; close $fh; print unpack("W", substr($data, 0, 1)) . ' - ' . unpack("W", substr($data, 1, 2)) . ' - ' . unpack("W", substr($data, 2, 3)) . "\n\n"; $data =~ s/.*?\[/\[/; # <-- fudge to clear character(s) my $json = decode_json $data; print Dumper $json; <code> Using this JSON file as a test... <code> [{"registered":true,"insolvent":false,"administration":true,"test":fal +se}]

Unpacking the first three characters gives 239 - 187 - 191 and the substitution seems to have the desired effect but it seems to be a bit of a fudge!

Can you suggest a "better" way to deal with this?

The output from Data::Dumper is a bit strange:

$VAR1 = [ { 'administration' => bless( do{\(my $o = 1)}, 'JSON::PP::Bo +olean' ), 'registered' => $VAR1->[0]{'administration'}, 'test' => bless( do{\(my $o = 0)}, 'JSON::PP::Boolean' ), 'insolvent' => $VAR1->[0]{'test'} } ];
I've come across  bless( do{\(my $o = 0)}, 'JSON::PP::Boolean' ) before instead of false but not $VAR1->[0]{'test'}. I guess this is so Data::Dumper doesn't have to create an object for each boolean. It instead it represents them in terms of ones it has previously created. Is that about right?

I have proved that this is just Data::Dumper and not the underlying data structure by this dereference:

foreach my $key(keys %{@{$json}[0]}) { print "$key - "; print ${@{$json}[0]}{$key}; print "\n"; }
Which produces zeros and ones for false and true...

Replies are listed 'Best First'.
Re: Rogue character(s) at start of JSON file
by GrandFather (Saint) on Jan 17, 2023 at 02:33 UTC
      UTF-8 text files with Byte Order Mark may be of interest.

      The OP in that thread just doesn't want to believe ikegami, who posts his solution essentially 3 times Re^3: UTF-8 text files with Byte Order Mark. I've seen it more than once that people just don't want to believe him. Usually something involving representations.

        I've seen it more than once that people just don't want to believe him

        Sometimes, we hear or read things that just don't make sense. Our natural instinct is to reject them, especially if they go against what we hold to be true.

        However, there are a few monks that when they say something that seems wrong, cause me to carefully check that I understand what they are saying and then question my own background knowledge. ikegami is one of those monks! That's not to say that anyone is infallible...just that the words of some monks cause more self questioning than others because they are properly saying something wise or, at the very least, something I can learn from.

Re: Rogue character(s) at start of JSON file (BOM; dumping references)
by LanX (Saint) on Jan 17, 2023 at 00:08 UTC
    > \x{feff}

    It's a BOM and I seem to remember we had the discussion before.

    Purging exactly this with a regex anchored at the start of the file° is fine.

    > s/.*?\[/\[/

    Better: s/^\x{feff}//

    But don't include the \[ in the pattern, a JSON-stream doesn't need to start with an array.

    > It instead it represents them in terms of ones it has previously created. Is that about right?

    that's the way how Data::Dumper handles repeated references and circular structures, in order to accurately reproduce the original data after evaluation.

    EDIT

    from the DESCRIPTION:

      ... Handles self-referential structures correctly.

      The return value can be evaled to get back an identical copy of the original reference structure. (Please do consider the security implications of eval'ing code from untrusted sources!)

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

    update

    °) NB I don't recommend deleting any BOMs elsewhere, they could have a meaning. And if they really caused an error this should be investigated thoroughly. A BOM at the file's start (and only there) is often added automatically by many programs.

      «…they could have a meaning…»

      I'm not so sure about that anymore. At least in this case. More or less by accident I stumbled across this. For convenience the link to the RCF. Regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

        > > …they could have a meaning…

        > I'm not so sure about that anymore.

        Simpl(-istic) example, imagine a data-structure

        %unicode = ( ..., "BOM" => "\x{FEFF}", ... )

        and transfer it as JSON.

        Removing all "\x{FEFF}" would wreck the data.

        While a BOM at the strings start is illegal JSON, it's not uncommon.

        So removing just a leading BOM, before trying to convert the JSON, is safe.

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

      A BOM at the file's start (and only there) is often added automatically by many programs

      As used by UK Government departments, it seems!

        Misguidedly I'd say

        JSON is always supposed to be utf-8 as far as I know.

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

      Better: s/^\x{feff}//

      I agree that this option seems better...but it doesn't work!

      print "$data\n"; $data =~ s/^\x{feff}//; # Strip off BOM print "$data\n";

      This prints out two identical lines, both starting [{"date_of_extract":"2023-01-16T00:00:00"

      Could it be that the BOM character is not FEFF despite the error?

      malformed JSON string, neither tag, array, object, number, string or a +tom, at character offset 0 (before &quot;\x{feff}[{&quot;date_of_e... +&quot;)

        something is otherwise wrong

        plz use Devel::Peek to find out if it's properly encoded and show us the result here.

        Cheers Rolf
        (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
        Wikisyntax for the Monastery

Re: Rogue character(s) at start of JSON file
by kcott (Archbishop) on Jan 19, 2023 at 22:53 UTC

    G'day Bod,

    "I'm processing some JSON files using JSON ..."

    If you look down to the "SEE ALSO" section of that documentation, you'll see a series of RFCs: RFC8259 obsoletes RFC7159, which in turn obsoletes RFC4627. I don't know if there's anything newer; in the following, I'm referencing information in RFC8259.

    "$data =~ s/.*?\[/\[/; ... seems to be a bit of a fudge!"

    As written, I would agree; however, it can be improved. From RFC8259:

    8.1. Character Encoding JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629]. Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON- based software implementations have chosen to use the UTF-8 encodin +g, to the extent that it is the only encoding that achieves interoperability. Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests o +f interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

    So, the JSON you're sourcing (with a BOM) is technically invalid; however, it is acceptable to fix that yourself by ignoring (removing) the BOM.

    2. JSON Grammar ...

    Use this grammar specification to formulate your regex for handling BOM removal. Here's some example code; it's primarily intended to show technique, rather than being a specific solution. Enhance, extend, and otherwise adapt to suit your needs. If you're dealing with more than one of these "dodgy" JSON files, consider putting the logic in a module for reuse.

    #!/usr/bin/env perl use 5.010; use strict; use warnings; my @json_tests = ( '', 'crap', '[]', '{}', " []", "\t[]", "\x{feff}[]", qq<\x{feff}\t{"k":"v"}>, ); for my $test (@json_tests) { _json_chars($test); my $clean_json = clean_json($test); _json_chars($clean_json); say '-' x 40; } sub clean_json { my ($json) = @_; return '' unless length $json; state $re = qr{(?x: ^ ( (?: \x{feff}| ) ) ( [\x{20}\x{09}\x{0a}\x{0d}]* (?: false|null|true|\[|\{|" ) .* ) )}; if ($json =~ $re) { my ($bom, $text) = ($1, $2); if ($bom eq '') { say "JSON good as is."; } else { $json = $text; say "JSON cleaned -- BOM removed."; } } else { say 'Invalid JSON! Nothing cleaned.'; } return $json; } sub _json_chars { my ($json) = @_; if (! length $json) { say 'Zero-length JSON'; } else { say 'JSON chars: ', join '-', map sprintf('%x', ord), split //, $json; } return; }

    As you can see, I've included a number of tests. Add more to cover your use cases. Here's the output using what's currently there.

    Zero-length JSON Zero-length JSON ---------------------------------------- JSON chars: 63-72-61-70 Invalid JSON! Nothing cleaned. JSON chars: 63-72-61-70 ---------------------------------------- JSON chars: 5b-5d JSON good as is. JSON chars: 5b-5d ---------------------------------------- JSON chars: 7b-7d JSON good as is. JSON chars: 7b-7d ---------------------------------------- JSON chars: 20-20-5b-5d JSON good as is. JSON chars: 20-20-5b-5d ---------------------------------------- JSON chars: 9-5b-5d JSON good as is. JSON chars: 9-5b-5d ---------------------------------------- JSON chars: feff-5b-5d JSON cleaned -- BOM removed. JSON chars: 5b-5d ---------------------------------------- JSON chars: feff-9-7b-22-6b-22-3a-22-76-22-7d JSON cleaned -- BOM removed. JSON chars: 9-7b-22-6b-22-3a-22-76-22-7d ----------------------------------------

    — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11149618]
Approved by johngg
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2024-04-25 12:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found