I've been using XML::Fast to process XML files for some time and successfully. However, the code was moved to a newer machine and has stopped working in some circumstances. A difference between the machines is XML::Fast version, 0.11 on original machine (working) and 0.17 on new machine (not working). When no other changes are made but to upgrade to 0.17 on the old machine it also stops working.
The error I'm getting is:
Failed to encode 2017-9-21T08-49-17.XML to JSON for indexing - malform
+ed or illegal unicode character in string [�ndby IF], cannot c
+onvert to JSON at xx.pm line 1827.
The XML file comes from a 3rd party and is ISO-8859-1 encoded. The bit it is complaining about is <Value>Br<F8>ndby IF</Value>. A cut down version of the XML which fails is:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xx feedtype="delta"><Timestamp CreatedTime="2017-09-21T06:49:17" Time
+Zone="GMT"/><Value>Brøndby IF</Value></xx>
The code which is now failing is:
use Cpanel::JSON::XS;
use XML::Fast;
sub esIndexFile2 {
my ($self, $file) = @_;
my $xml = do {
local $/ = undef;
open (my $fh, "<:encoding(ISO-8859-1)", $file) or die "Failed
+to open $file - $!";
<$fh>;
};
$xml =~ s/^(?:.*\n)//; # remove first line - the encoding lin
+e
my $hash;
eval {
$hash = xml2hash $xml;
};
if (my $ev = $@) {
warn("Failed to parse file $file for indexing - $@ - SKIPPING"
+);
return;
}
my $json = eval {
encode_json($hash); # <------------ fails here
};
if (my $ev = $@) {
$self->logwarn("Failed to encode $file to JSON for indexing -
+$@ - SKIPPING");
return;
}
return 1;
}
The changes file for XML::Fast is not too helpful. I have discovered adding utf8decode => 1 to the xml2hash makes it work now but I don't really understand why. I am doing anything wrong here? What might have changed in XML::Fast to cause this to happen?
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.