|Pathologically Eclectic Rubbish Lister|
Ignoring not well-formed (invalid token) errorsby brettski (Initiate)
|on Jan 19, 2015 at 05:03 UTC||Need Help??|
brettski has asked for the wisdom of the Perl Monks concerning the following question:
Perl Monks, I have a 13GB xml file that I'm trying to parse through using XML::Twig, and after getting about half way though the xml file (88,144,211 lines of 172,881,183) the parsing stopped and threw the error:
not well-formed (invalid token) at line 88144211, column 36, byte -1674366310 at /usr/perl5/site_perl/5.12/i86pc-solaris-64int/XML/Parser.pm line 187 at .../bin/xml_parser.pl line 54 at .../bin/xml_parser.pl line 54
I used sed to pull the not well formed line from the xml file, along with the preceeding 10 lines, which I've pasted here prefixed with the line numbers, and I think it's actually line 88144210 that caused my issue.
sed -n "88144200,88144211p;88144211q;" huge_xml_file.xml
Problem for me is, that I have no way of fixing these large XML files before I parse them, and I have to parse them as is. Is there a way to ignore the not well-formed lines and have my script continue on with parsing the xml file? Here is my script.
Which I call using: ./xml_parser.pl path=/tmp file=huge_xml_file.xml > huge_xml_file.parsed.xml
I need to process these large xml files on a daily basis, and each one takes roughly 4 hours to parse on my current system. I never know when one of these miss formed lines will appear, and they are rare. Since there is only one bad line in a 172,881,183 line xml file, I'm wondering is there a way for my parser to ignore these lines rather than throwing the error?