But 0x1c56920 "\357\273\277 at the start of the file is not removed with $str =~ s/^\x{feff}//;
...and...I cannot be sure that 0x1c56920 "\357\273\277 will be at the start of all the JSON files - or is it safe to assume that? I suspect not!
| [reply] [d/l] [select] |
not removed with $str =~ s/^\x{feff}//;
compare how the behavior changes with open my $fh, '<:encoding(UTF-8)', '../data/publicextract.charity.json' or die "Unable to read Charity JSON File"; compared to the open line you currently use.
If you want perl to treat the bytes in the file as UTF-8, and thus be able to use s/^\x{feff}/, you have to tell perl to read the file as UTF-8¹. If you want perl to continue to read the file as a series of bytes (not using the UTF-8 encoding), then leave your open as-is, and have your regex instead either search for the three bytes in octal with s/^\357\273\277// or in hex with s/^\xEF\xBB\xBF//.
#!perl
use 5.012; # strict, //
use warnings;
use Devel::Peek;
open my $fo, '>:raw', 'threebytes.bin';
print {$fo} "\xEF\xBB\xBF";
close $fo;
open my $fbytes, '<', 'threebytes.bin';
Dump($_ = <$fbytes>);
printf "length no-encoding: %d bytes\n", length($_);
printf "match no-encoding 3bytes? %s\n", m/^\xEF\xBB\xBF/ ? 'match' :
+ 'nope';
printf "match no-encoding unicode? %s\n", m/^\x{FEFF}/ ? 'match' : 'no
+pe';
close $fbytes;
open my $futf8, '<:encoding(UTF-8)', 'threebytes.bin';
Dump($_ = <$futf8>);
printf "length utf8: %d characters\n", length($_);
printf "match utf8 3bytes? %s\n", m/^\xEF\xBB\xBF/ ? 'match' : 'nope'
+;
printf "match utf8 unicode? %s\n", m/^\x{FEFF}/ ? 'match' : 'nope';
close $futf8;
__END__
SV = PV(0x6ac038) at 0xb3ebe0
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0xb4d308 "\357\273\277"\0
CUR = 3
LEN = 81
SV = PV(0x6ac038) at 0xb3ebe0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x3442c98 "\357\273\277"\0 [UTF8 "\x{feff}"]
CUR = 3
LEN = 10
length no-encoding: 3 bytes
match no-encoding 3bytes? match
match no-encoding unicode? nope
length utf8: 1 characters
match utf8 3bytes? nope
match utf8 unicode? match
¹: or, not shown, use Encode::decode('UTF-8', $octets) from Encode | [reply] [d/l] [select] |
| [reply] |
compare how the behavior changes with open my $fh, '<:encoding(UTF-8)', '../data/publicextract.charity.json' or die "Unable to read Charity JSON File"; compared to the open line you currently use.
I neglected to mention that I had previously read the file as UTF-8 in the way you suggest. But then decode_json complains about "wide characters" which I don't understand.
have your regex instead either search for the three bytes in octal with s/^\357\273\277//
That's the bit I needed!
I was getting thrown by the 0x1c56920 in 0x1c56920 "\357\273\277
That makes perfect sense except I don't understand what 0x1c56920 means in the output from the Devel::Peek Dump function.
| [reply] [d/l] [select] |
Perl has two ways to represent strings,
- without UTF-8 flag as "octet streams" i.e. a list of bytes
- with UTF-8 flag as "characters" in the internal representation°
\x{FEFF} represents the unicode character with the code-point #FEFF, since Devel::Peek shows that the flag is missing, this character can't be found in the octet stream while replacing.
You need to tell Perl how to interpret the read data, the fact that it's "bytewise utf-8" alone doesn't help to see it as list of characters.
The use utf8; in my example just told Perl to read the script's source and all embedded literal strings as utf8.
see Encode for more.
°) which is almost UTF-8, hence the flag is - for historical reasons - a bit of a misnomer
| [reply] [d/l] [select] |