http://qs321.pair.com?node_id=11149700


in reply to Re^6: Rogue character(s) at start of JSON file (BOM; dumping references)
in thread Rogue character(s) at start of JSON file

not removed with $str =~ s/^\x{feff}//;

compare how the behavior changes with open my $fh, '<:encoding(UTF-8)', '../data/publicextract.charity.json' or die "Unable to read Charity JSON File"; compared to the open line you currently use.

If you want perl to treat the bytes in the file as UTF-8, and thus be able to use s/^\x{feff}/, you have to tell perl to read the file as UTF-8¹. If you want perl to continue to read the file as a series of bytes (not using the UTF-8 encoding), then leave your open as-is, and have your regex instead either search for the three bytes in octal with s/^\357\273\277// or in hex with s/^\xEF\xBB\xBF//.

#!perl use 5.012; # strict, // use warnings; use Devel::Peek; open my $fo, '>:raw', 'threebytes.bin'; print {$fo} "\xEF\xBB\xBF"; close $fo; open my $fbytes, '<', 'threebytes.bin'; Dump($_ = <$fbytes>); printf "length no-encoding: %d bytes\n", length($_); printf "match no-encoding 3bytes? %s\n", m/^\xEF\xBB\xBF/ ? 'match' : + 'nope'; printf "match no-encoding unicode? %s\n", m/^\x{FEFF}/ ? 'match' : 'no +pe'; close $fbytes; open my $futf8, '<:encoding(UTF-8)', 'threebytes.bin'; Dump($_ = <$futf8>); printf "length utf8: %d characters\n", length($_); printf "match utf8 3bytes? %s\n", m/^\xEF\xBB\xBF/ ? 'match' : 'nope' +; printf "match utf8 unicode? %s\n", m/^\x{FEFF}/ ? 'match' : 'nope'; close $futf8; __END__ SV = PV(0x6ac038) at 0xb3ebe0 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0xb4d308 "\357\273\277"\0 CUR = 3 LEN = 81 SV = PV(0x6ac038) at 0xb3ebe0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x3442c98 "\357\273\277"\0 [UTF8 "\x{feff}"] CUR = 3 LEN = 10 length no-encoding: 3 bytes match no-encoding 3bytes? match match no-encoding unicode? nope length utf8: 1 characters match utf8 3bytes? nope match utf8 unicode? match


¹: or, not shown, use Encode::decode('UTF-8', $octets) from Encode

Replies are listed 'Best First'.
Re^8: Rogue character(s) at start of JSON file (BOM; dumping references)
by LanX (Saint) on Jan 19, 2023 at 19:19 UTC
    > If you want perl to continue to read the file as a series of bytes (not using the UTF-8 encoding)

    I'd expect JSON libraries to fail processing octects of undecoded UTF-8, so I'd say this approach is not optimal ...

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

Re^8: Rogue character(s) at start of JSON file (BOM; dumping references)
by Bod (Parson) on Jan 19, 2023 at 19:47 UTC
    compare how the behavior changes with open my $fh, '<:encoding(UTF-8)', '../data/publicextract.charity.json' or die "Unable to read Charity JSON File"; compared to the open line you currently use.

    I neglected to mention that I had previously read the file as UTF-8 in the way you suggest. But then decode_json complains about "wide characters" which I don't understand.

    have your regex instead either search for the three bytes in octal with s/^\357\273\277//

    That's the bit I needed!
    I was getting thrown by the 0x1c56920 in 0x1c56920 "\357\273\277

    That makes perfect sense except I don't understand what 0x1c56920 means in the output from the Devel::Peek Dump function.

      > decode_json

      • encode_json

        Converts the given Perl data structure to a UTF-8 encoded, binary string (that is, the string contains octets only) ...

      • decode_json

        ... The opposite of encode_json: expects an UTF-8 (binary) string and tries to parse that as an UTF-8 encoded JSON text ...

      Cheers Rolf
      (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
      Wikisyntax for the Monastery

        Yes indeed it should be UTF-8 but we've already established that the file has "strange" (being polite) encoding...

        $| = 1; $/ = undef; print "Reading JSON file"; open my $fh, '<:encoding(UTF-8)', '../data/publicextract.charity.json' + or die "Unable to read Charity JSON File"; my $data = <$fh>; close $fh; print "...done\n"; $data =~ s/^\x{feff}//; # Strip off BOM print "Decoding JSON file"; my $js = decode_json $data; # line 24 print "...done\n";

        This takes about 10 minutes to read the 462Mb JSON file then fails with Decoding JSON fileWide character in subroutine entry at import.pl line 24

        Given the time taken to open the file in UTF-8 and the error, I am thinking there is some nasty encoding hidden somewhere in this file

        UPDATE

        Changing the encoding like so

        print "Reading JSON file"; open my $fh, '<', '../data/publicextract.charity.json' or die "Unable +to read Charity JSON File"; my $data = <$fh>; close $fh; print "...done\n"; $data =~ s/^\357\273\277//; # Strip off BOM
        takes about 5 minutes to open the file but gives the strange error Decoding JSON fileKilled

        "Strange" because the error doesn't include at import.pl line 24!

        Another UPDATE

        It seems I might be running out of memory...380400 records in the JSON file seems to be too much...

      That makes perfect sense except I don't understand what 0x1c56920 means in the output from the Devel::Peek Dump function

      Looking at Devel::Peek documentation, I see "A simple scalar string" section which describes most of the output for a string scalar. Based on what that example says about the other hex numbers in the same output, and the fact that the hex numbers are all in the same approximate value range, I believe that it's the internal address where the string is held, much like the other two are the address of the scalar's head and body.