Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Rogue character(s) at start of JSON file (BOM; dumping references)

by LanX (Saint)
on Jan 17, 2023 at 00:08 UTC ( [id://11149620]=note: print w/replies, xml ) Need Help??


in reply to Rogue character(s) at start of JSON file

> \x{feff}

It's a BOM and I seem to remember we had the discussion before.

Purging exactly this with a regex anchored at the start of the file° is fine.

> s/.*?\[/\[/

Better: s/^\x{feff}//

But don't include the \[ in the pattern, a JSON-stream doesn't need to start with an array.

> It instead it represents them in terms of ones it has previously created. Is that about right?

that's the way how Data::Dumper handles repeated references and circular structures, in order to accurately reproduce the original data after evaluation.

EDIT

from the DESCRIPTION:

    ... Handles self-referential structures correctly.

    The return value can be evaled to get back an identical copy of the original reference structure. (Please do consider the security implications of eval'ing code from untrusted sources!)

Cheers Rolf
(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
Wikisyntax for the Monastery

update

°) NB I don't recommend deleting any BOMs elsewhere, they could have a meaning. And if they really caused an error this should be investigated thoroughly. A BOM at the file's start (and only there) is often added automatically by many programs.

Replies are listed 'Best First'.
Re^2: Rogue character(s) at start of JSON file (BOM; dumping references)
by karlgoethebier (Abbot) on Jan 20, 2023 at 11:56 UTC
    «…they could have a meaning…»

    I'm not so sure about that anymore. At least in this case. More or less by accident I stumbled across this. For convenience the link to the RCF. Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

      > > …they could have a meaning…

      > I'm not so sure about that anymore.

      Simpl(-istic) example, imagine a data-structure

      %unicode = ( ..., "BOM" => "\x{FEFF}", ... )

      and transfer it as JSON.

      Removing all "\x{FEFF}" would wreck the data.

      While a BOM at the strings start is illegal JSON, it's not uncommon.

      So removing just a leading BOM, before trying to convert the JSON, is safe.

      Cheers Rolf
      (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
      Wikisyntax for the Monastery

Re^2: Rogue character(s) at start of JSON file (BOM; dumping references)
by Bod (Parson) on Jan 17, 2023 at 20:55 UTC
    A BOM at the file's start (and only there) is often added automatically by many programs

    As used by UK Government departments, it seems!

      Misguidedly I'd say

      JSON is always supposed to be utf-8 as far as I know.

      Cheers Rolf
      (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
      Wikisyntax for the Monastery

        Misguidedly I'd say

        Yes indeed...

        Having seen something of their database structure from where the JSON is derived, it looks like their problems are more deep rooted than just a rouge BOM character. But hey, that's what I've got, and I have to work with it. Fortunately, I only need a small slice of the data and I am not going to be running any queries against it.

Re^2: Rogue character(s) at start of JSON file (BOM; dumping references)
by Bod (Parson) on Jan 19, 2023 at 15:54 UTC
    Better: s/^\x{feff}//

    I agree that this option seems better...but it doesn't work!

    print "$data\n"; $data =~ s/^\x{feff}//; # Strip off BOM print "$data\n";

    This prints out two identical lines, both starting [{"date_of_extract":"2023-01-16T00:00:00"

    Could it be that the BOM character is not FEFF despite the error?

    malformed JSON string, neither tag, array, object, number, string or a +tom, at character offset 0 (before "\x{feff}[{"date_of_e... +")

      something is otherwise wrong

      plz use Devel::Peek to find out if it's properly encoded and show us the result here.

      Cheers Rolf
      (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
      Wikisyntax for the Monastery

        plz use Devel::Peek to find out if it's properly encoded and show us the result here

        I've not come across Devel::Peek before, let alone used it - so please bear with me if this is not right...

        #!/usr/bin/perl use CGI::Carp qw(fatalsToBrowser); use strict; use warnings; use Site::Utils; use JSON; use Devel::Peek; print "Content-type: text/plain\n\n"; open my $fh, '<', '../data/publicextract.charity.json' or die "Unable +to read Charity JSON File"; my $data = <$fh>; print "$data\n\n"; open STDERR, ">", 'output.txt' or die $!; print STDERR "Before\n"; Dump ($data); $data =~ s/^\x{feff}//; # Strip off BOM print STDERR "\n\nAfter\n"; Dump ($data); exit;

        This gives this output...

        Before SV = PV(0x1569cf0) at 0x15877a0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x1c56920 "\357\273\277[{\"date_of_extract\":\"2023-01-16T00:00 +:00\",\"organisation_number\":1,\"registered_charity_number\":200027, +\"linked_charity_number\":1,\"charity_name\":\"POTTERNE MISSION ROOM +AND TRUST\",\"charity_type\":null,\"charity_registration_status\":\"R +emoved\",\"date_of_registration\":\"1962-05-17T00:00:00\",\"date_of_r +emoval\":\"2014-04-16T00:00:00\",\"charity_reporting_status\":null,\" +latest_acc_fin_period_start_date\":null,\"latest_acc_fin_period_end_d +ate\":null,\"latest_income\":null,\"latest_expenditure\":null,\"chari +ty_contact_address1\":null,\"charity_contact_address2\":null,\"charit +y_contact_address3\":null,\"charity_contact_address4\":null,\"charity +_contact_address5\":null,\"charity_contact_postcode\":null,\"charity_ +contact_phone\":null,\"charity_contact_email\":null,\"charity_contact +_web\":null,\"charity_company_registration_number\":null,\"charity_in +solvent\":false,\"charity_in_administration\":false,\"charity_previou +sly_excepted\":null,\"charity_is_cdf_or_cif\":null,\"charity_is_cio\" +:null,\"cio_is_dissolved\":null,\"date_cio_dissolution_notice\":null, +\"charity_activities\":null,\"charity_gift_aid\":null,\"charity_has_l +and\":null}\r\n"\0 CUR = 1082 LEN = 1122 After SV = PV(0x1569cf0) at 0x15877a0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x1c56920 "\357\273\277[{\"date_of_extract\":\"2023-01-16T00:00 +:00\",\"organisation_number\":1,\"registered_charity_number\":200027, +\"linked_charity_number\":1,\"charity_name\":\"POTTERNE MISSION ROOM +AND TRUST\",\"charity_type\":null,\"charity_registration_status\":\"R +emoved\",\"date_of_registration\":\"1962-05-17T00:00:00\",\"date_of_r +emoval\":\"2014-04-16T00:00:00\",\"charity_reporting_status\":null,\" +latest_acc_fin_period_start_date\":null,\"latest_acc_fin_period_end_d +ate\":null,\"latest_income\":null,\"latest_expenditure\":null,\"chari +ty_contact_address1\":null,\"charity_contact_address2\":null,\"charit +y_contact_address3\":null,\"charity_contact_address4\":null,\"charity +_contact_address5\":null,\"charity_contact_postcode\":null,\"charity_ +contact_phone\":null,\"charity_contact_email\":null,\"charity_contact +_web\":null,\"charity_company_registration_number\":null,\"charity_in +solvent\":false,\"charity_in_administration\":false,\"charity_previou +sly_excepted\":null,\"charity_is_cdf_or_cif\":null,\"charity_is_cio\" +:null,\"cio_is_dissolved\":null,\"date_cio_dissolution_notice\":null, +\"charity_activities\":null,\"charity_gift_aid\":null,\"charity_has_l +and\":null}\r\n"\0 CUR = 1082 LEN = 1122

        Does that help?

        UPDATE:

        I've realised that because I am reading just the first line of the JSON file, it is malformed as it doesn't have the training ']' character. However, I have added $data .= ']'; to manually add it back on. This still doesn't solve the BOM issue at the end of the file but it might complicate testing...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11149620]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-16 05:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found