Perl Monks, I have a 13GB xml file that I'm trying to parse through using XML::Twig, and after getting about half way though the xml file (88,144,211 lines of 172,881,183) the parsing stopped and threw the error:
not well-formed (invalid token) at line 88144211, column 36, byte -1674366310 at /usr/perl5/site_perl/5.12/i86pc-solaris-64int/XML/Parser.pm line 187
at .../bin/xml_parser.pl line 54
at .../bin/xml_parser.pl line 54
I used sed to pull the not well formed line from the xml file, along with the preceeding 10 lines, which I've pasted here prefixed with the line numbers, and I think it's actually line 88144210 that caused my issue.
sed -n "88144200,88144211p;88144211q;" huge_xml_file.xml
88144200: <es:vsDataReportConfigSearch>
88144201: <es:a1a2SearchThresholdRsrp>-110</es:a1a2SearchThresholdRsrp
+>
88144202: <es:a1a2SearchThresholdRsrq>-195</es:a1a2SearchThresholdRsrq
+>
88144203: <es:a2CriticalThresholdRsrp>-122</es:a2CriticalThresholdRsrp
+>
88144204: <es:a2CriticalThresholdRsrq>-195</es:a2CriticalThresholdRsrq
+>
88144205: <es:hysteresisA1A2SearchRsrp>30</es:hysteresisA1A2SearchRsrp
+>
88144206: <es:hysteresisA1A2SearchRsrq>150</es:hysteresisA1A2SearchRsr
+q>
88144207: <es:hysteresisA2CriticalRsrp>10</es:hysteresisA2CriticalRsrp
+>
88144208: <es:hysteresisA2CriticalRsrq>10</es:hysteresisA2CriticalRsrq
+>
88144209: <es:timeToTriggerA1Search>640</es:timeToTriggerA1Search>
88144210: <es
88144211: <es:lbUtranB1ThresholdRscpOffset>0</es:lbUtranB1ThresholdRsc
+pOffset><es:lbQciProfileHandling>1</es:lbQciProfileHandling>
Problem for me is, that I have no way of fixing these large XML files before I parse them, and I have to parse them as is. Is there a way to ignore the not well-formed lines and have my script continue on with parsing the xml file? Here is my script.
use strict;
use XML::Twig;
my @moList = qw(es:vsDataExternalENodeBFunction es:vsDataTermPointToEN
+B es:vsDataEUtranCellFDD es:vsDataEUtranFreqRelation es:vsDataEUtranC
+ellRelation es:vsDataENodeBFunction es:vsDataExternalEUtranCellFDD);
# Subroutine declarations
sub handle_mo; sub usage;
# Parameter hash
my($key,$value,%param);
foreach my $item (@ARGV){
my($key,$value) = split /=/, $item;
$param{$key} = $value;
}
# Required parameters
if (!defined($param{"path"})){
usage;
die "No path defined\n";
}
if (!defined($param{"file"})){
usage;
die "No xml file defined\n";
}
my $path2xml = $param{"path"};
my $filename = $param{"file"};
my %handlers = map {$_ => \&handle_mo} @moList;
my $twig = new XML::Twig( twig_roots => \%handlers);
$filename = $path2xml . "/" . $filename;
$twig->parsefile($filename);
my $root = $twig->root;
print "Parsing completed\n";
# Subroutines
sub usage {
print "Usage:\n xml_parser path=<directory> file=<xml_file>\n";
}
sub handle_mo {
my ( $t, $elt) = @_;
print $elt->print, "\n";
$t->purge;
}
Which I call using:
./xml_parser.pl path=/tmp file=huge_xml_file.xml > huge_xml_file.parsed.xml
I need to process these large xml files on a daily basis, and each one takes roughly 4 hours to parse on my current system. I never know when one of these miss formed lines will appear, and they are rare. Since there is only one bad line in a 172,881,183 line xml file, I'm wondering is there a way for my parser to ignore these lines rather than throwing the error?
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.