http://qs321.pair.com?node_id=1113700

brettski has asked for the wisdom of the Perl Monks concerning the following question:

Perl Monks, I have a 13GB xml file that I'm trying to parse through using XML::Twig, and after getting about half way though the xml file (88,144,211 lines of 172,881,183) the parsing stopped and threw the error:

not well-formed (invalid token) at line 88144211, column 36, byte -1674366310 at /usr/perl5/site_perl/5.12/i86pc-solaris-64int/XML/Parser.pm line 187 at .../bin/xml_parser.pl line 54 at .../bin/xml_parser.pl line 54

I used sed to pull the not well formed line from the xml file, along with the preceeding 10 lines, which I've pasted here prefixed with the line numbers, and I think it's actually line 88144210 that caused my issue.

sed -n "88144200,88144211p;88144211q;" huge_xml_file.xml

88144200: <es:vsDataReportConfigSearch> 88144201: <es:a1a2SearchThresholdRsrp>-110</es:a1a2SearchThresholdRsrp +> 88144202: <es:a1a2SearchThresholdRsrq>-195</es:a1a2SearchThresholdRsrq +> 88144203: <es:a2CriticalThresholdRsrp>-122</es:a2CriticalThresholdRsrp +> 88144204: <es:a2CriticalThresholdRsrq>-195</es:a2CriticalThresholdRsrq +> 88144205: <es:hysteresisA1A2SearchRsrp>30</es:hysteresisA1A2SearchRsrp +> 88144206: <es:hysteresisA1A2SearchRsrq>150</es:hysteresisA1A2SearchRsr +q> 88144207: <es:hysteresisA2CriticalRsrp>10</es:hysteresisA2CriticalRsrp +> 88144208: <es:hysteresisA2CriticalRsrq>10</es:hysteresisA2CriticalRsrq +> 88144209: <es:timeToTriggerA1Search>640</es:timeToTriggerA1Search> 88144210: <es 88144211: <es:lbUtranB1ThresholdRscpOffset>0</es:lbUtranB1ThresholdRsc +pOffset><es:lbQciProfileHandling>1</es:lbQciProfileHandling>

Problem for me is, that I have no way of fixing these large XML files before I parse them, and I have to parse them as is. Is there a way to ignore the not well-formed lines and have my script continue on with parsing the xml file? Here is my script.

use strict; use XML::Twig; my @moList = qw(es:vsDataExternalENodeBFunction es:vsDataTermPointToEN +B es:vsDataEUtranCellFDD es:vsDataEUtranFreqRelation es:vsDataEUtranC +ellRelation es:vsDataENodeBFunction es:vsDataExternalEUtranCellFDD); # Subroutine declarations sub handle_mo; sub usage; # Parameter hash my($key,$value,%param); foreach my $item (@ARGV){ my($key,$value) = split /=/, $item; $param{$key} = $value; } # Required parameters if (!defined($param{"path"})){ usage; die "No path defined\n"; } if (!defined($param{"file"})){ usage; die "No xml file defined\n"; } my $path2xml = $param{"path"}; my $filename = $param{"file"}; my %handlers = map {$_ => \&handle_mo} @moList; my $twig = new XML::Twig( twig_roots => \%handlers); $filename = $path2xml . "/" . $filename; $twig->parsefile($filename); my $root = $twig->root; print "Parsing completed\n"; # Subroutines sub usage { print "Usage:\n xml_parser path=<directory> file=<xml_file>\n"; } sub handle_mo { my ( $t, $elt) = @_; print $elt->print, "\n"; $t->purge; }

Which I call using: ./xml_parser.pl path=/tmp file=huge_xml_file.xml > huge_xml_file.parsed.xml

I need to process these large xml files on a daily basis, and each one takes roughly 4 hours to parse on my current system. I never know when one of these miss formed lines will appear, and they are rare. Since there is only one bad line in a 172,881,183 line xml file, I'm wondering is there a way for my parser to ignore these lines rather than throwing the error?