comment on

Perl Monks, I have a 13GB xml file that I'm trying to parse through using XML::Twig, and after getting about half way though the xml file (88,144,211 lines of 172,881,183) the parsing stopped and threw the error:

not well-formed (invalid token) at line 88144211, column 36, byte -1674366310 at /usr/perl5/site_perl/5.12/i86pc-solaris-64int/XML/Parser.pm line 187 at .../bin/xml_parser.pl line 54 at .../bin/xml_parser.pl line 54

I used sed to pull the not well formed line from the xml file, along with the preceeding 10 lines, which I've pasted here prefixed with the line numbers, and I think it's actually line 88144210 that caused my issue.

sed -n "88144200,88144211p;88144211q;" huge_xml_file.xml

88144200: <es:vsDataReportConfigSearch>
88144201: <es:a1a2SearchThresholdRsrp>-110</es:a1a2SearchThresholdRsrp
+>
88144202: <es:a1a2SearchThresholdRsrq>-195</es:a1a2SearchThresholdRsrq
+>
88144203: <es:a2CriticalThresholdRsrp>-122</es:a2CriticalThresholdRsrp
+>
88144204: <es:a2CriticalThresholdRsrq>-195</es:a2CriticalThresholdRsrq
+>
88144205: <es:hysteresisA1A2SearchRsrp>30</es:hysteresisA1A2SearchRsrp
+>
88144206: <es:hysteresisA1A2SearchRsrq>150</es:hysteresisA1A2SearchRsr
+q>
88144207: <es:hysteresisA2CriticalRsrp>10</es:hysteresisA2CriticalRsrp
+>
88144208: <es:hysteresisA2CriticalRsrq>10</es:hysteresisA2CriticalRsrq
+>
88144209: <es:timeToTriggerA1Search>640</es:timeToTriggerA1Search>
88144210: <es
88144211: <es:lbUtranB1ThresholdRscpOffset>0</es:lbUtranB1ThresholdRsc
+pOffset><es:lbQciProfileHandling>1</es:lbQciProfileHandling>
[download]

Problem for me is, that I have no way of fixing these large XML files before I parse them, and I have to parse them as is. Is there a way to ignore the not well-formed lines and have my script continue on with parsing the xml file? Here is my script.

use strict;
use XML::Twig;
my @moList = qw(es:vsDataExternalENodeBFunction es:vsDataTermPointToEN
+B es:vsDataEUtranCellFDD es:vsDataEUtranFreqRelation es:vsDataEUtranC
+ellRelation es:vsDataENodeBFunction es:vsDataExternalEUtranCellFDD);

# Subroutine declarations
sub handle_mo; sub usage;

# Parameter hash
my($key,$value,%param);
foreach my $item (@ARGV){
  my($key,$value) = split /=/, $item;
  $param{$key} = $value;
}
# Required parameters
if (!defined($param{"path"})){
  usage;
  die "No path defined\n";
}
if (!defined($param{"file"})){
  usage;
  die "No xml file defined\n";
}

my $path2xml = $param{"path"};
my $filename = $param{"file"};

my %handlers = map {$_ => \&handle_mo} @moList;
my $twig = new XML::Twig( twig_roots => \%handlers);

$filename = $path2xml . "/" . $filename;
$twig->parsefile($filename);
my $root = $twig->root;
print "Parsing completed\n";

# Subroutines
sub usage {
  print "Usage:\n  xml_parser path=<directory> file=<xml_file>\n";
}
sub handle_mo {
  my ( $t, $elt) = @_;
  print $elt->print, "\n";
  $t->purge;
}
[download]

Which I call using: ./xml_parser.pl path=/tmp file=huge_xml_file.xml > huge_xml_file.parsed.xml

I need to process these large xml files on a daily basis, and each one takes roughly 4 hours to parse on my current system. I never know when one of these miss formed lines will appear, and they are rare. Since there is only one bad line in a 172,881,183 line xml file, I'm wondering is there a way for my parser to ignore these lines rather than throwing the error?

In reply to Ignoring not well-formed (invalid token) errors by brettski

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Clear questions and runnable code get the best and fastest answer
	PerlMonks