brettski has asked for the wisdom of the Perl Monks concerning the following question:

Perl Monks, I have a 13GB xml file that I'm trying to parse through using XML::Twig, and after getting about half way though the xml file (88,144,211 lines of 172,881,183) the parsing stopped and threw the error:

not well-formed (invalid token) at line 88144211, column 36, byte -1674366310 at /usr/perl5/site_perl/5.12/i86pc-solaris-64int/XML/ line 187 at .../bin/ line 54 at .../bin/ line 54

I used sed to pull the not well formed line from the xml file, along with the preceeding 10 lines, which I've pasted here prefixed with the line numbers, and I think it's actually line 88144210 that caused my issue.

sed -n "88144200,88144211p;88144211q;" huge_xml_file.xml

88144200: <es:vsDataReportConfigSearch> 88144201: <es:a1a2SearchThresholdRsrp>-110</es:a1a2SearchThresholdRsrp +> 88144202: <es:a1a2SearchThresholdRsrq>-195</es:a1a2SearchThresholdRsrq +> 88144203: <es:a2CriticalThresholdRsrp>-122</es:a2CriticalThresholdRsrp +> 88144204: <es:a2CriticalThresholdRsrq>-195</es:a2CriticalThresholdRsrq +> 88144205: <es:hysteresisA1A2SearchRsrp>30</es:hysteresisA1A2SearchRsrp +> 88144206: <es:hysteresisA1A2SearchRsrq>150</es:hysteresisA1A2SearchRsr +q> 88144207: <es:hysteresisA2CriticalRsrp>10</es:hysteresisA2CriticalRsrp +> 88144208: <es:hysteresisA2CriticalRsrq>10</es:hysteresisA2CriticalRsrq +> 88144209: <es:timeToTriggerA1Search>640</es:timeToTriggerA1Search> 88144210: <es 88144211: <es:lbUtranB1ThresholdRscpOffset>0</es:lbUtranB1ThresholdRsc +pOffset><es:lbQciProfileHandling>1</es:lbQciProfileHandling>

Problem for me is, that I have no way of fixing these large XML files before I parse them, and I have to parse them as is. Is there a way to ignore the not well-formed lines and have my script continue on with parsing the xml file? Here is my script.

use strict; use XML::Twig; my @moList = qw(es:vsDataExternalENodeBFunction es:vsDataTermPointToEN +B es:vsDataEUtranCellFDD es:vsDataEUtranFreqRelation es:vsDataEUtranC +ellRelation es:vsDataENodeBFunction es:vsDataExternalEUtranCellFDD); # Subroutine declarations sub handle_mo; sub usage; # Parameter hash my($key,$value,%param); foreach my $item (@ARGV){ my($key,$value) = split /=/, $item; $param{$key} = $value; } # Required parameters if (!defined($param{"path"})){ usage; die "No path defined\n"; } if (!defined($param{"file"})){ usage; die "No xml file defined\n"; } my $path2xml = $param{"path"}; my $filename = $param{"file"}; my %handlers = map {$_ => \&handle_mo} @moList; my $twig = new XML::Twig( twig_roots => \%handlers); $filename = $path2xml . "/" . $filename; $twig->parsefile($filename); my $root = $twig->root; print "Parsing completed\n"; # Subroutines sub usage { print "Usage:\n xml_parser path=<directory> file=<xml_file>\n"; } sub handle_mo { my ( $t, $elt) = @_; print $elt->print, "\n"; $t->purge; }

Which I call using: ./ path=/tmp file=huge_xml_file.xml > huge_xml_file.parsed.xml

I need to process these large xml files on a daily basis, and each one takes roughly 4 hours to parse on my current system. I never know when one of these miss formed lines will appear, and they are rare. Since there is only one bad line in a 172,881,183 line xml file, I'm wondering is there a way for my parser to ignore these lines rather than throwing the error?

Replies are listed 'Best First'.
Re: Ignoring not well-formed (invalid token) errors
by bitingduck (Chaplain) on Jan 19, 2015 at 06:55 UTC

    Do you know if the bad XML is always of the same structure so that its removal can be automated? Part of your problem is that XML parsers are supposed to die horribly if they encounter badly formed XML. There are XML-ish things out there that have their own parsers as a result of this.

    XML::Twig violates the "die on bad xml" rule by offering calls that at least return from a failure and give you the error message rather than dieing, so that you might be able to recover from the failure with an automated fix: e.g.

    if (!safeparse($my_stuff)){handle_errors();}

    where handle_errors() checks the message and then runs some sort or preprocessor to remove the offending lines, then calls safeparse() again. It's a bit of a pain because it means you have to re-run all the stuff that you successfully parsed, but it's better than nothing.

    You might also experiment with using one of the HTML parsers to extract what you want. They're not likely to be as good with enormous files they should be more tolerant of bad behavior.

    And if you have a way to contact whoever is generating the files, you might point out that some of them are badly formed and that they might have a bug in their xml generator. If anyone else is using the files, they're probably running into similar problems.

      If the errors you're seeing fit on single lines and follow some patterns - like in the example you've shown - it might suffice to filter the bad file through something simple like say
      perl -ne 'print $_ if not m/^<es\s*$/' <bad_input.file >corrected_inpu +t.file
      That 'ignores' the fact that there's anything about XML and as such could be fast enough to be usable, even if it is an extra step.


        If you do it like:

        perl -ne'print $_ if ! /^<es\s*$/' huge.xml | perl -

        Then you don't even have to wait for the huge file to be read twice. The time required could well be almost the same as it would be without the filter. Since the filter code likely can run faster than the XML parsing code, the difference in run-time could just be the insignificant time it takes to filter one buffer's worth of XML. It would likely take a bit more CPU (probably less than 2x), but I doubt processing a huge XML file is usually CPU-bound on most systems.

        Though, brettski didn't seem to find even that proposal acceptable when I proposed it in chat around the time that the root node was posted.

        - tye        

Re: Ignoring not well-formed (invalid token) errors
by Laurent_R (Canon) on Jan 19, 2015 at 07:24 UTC
    If the error is always showing the same pattern, maybe you could preprocess the file to remove the offending line(s). I know that the idea of preprocessing 13 GB is not very attractive, but sometimes you have to bite the bullet.

    Je suis Charlie.

      If it's looking for a simple pattern it might be doable in a reasonable amount of time. There are extractors for the Open Directory Project and Wikipedia dumps, both of which are in the many GB range, that can process very quickly, even on relatively old machines. I was pulling all of the music content out of ODP in less than a few minutes some 10 years ago on a mac laptop that was reasonably current then, and I don't recall how long it took to pull all the music topics out of Wikipedia, but I think it was quite reasonable.

Re: Ignoring not well-formed (invalid token) errors
by Anonymous Monk on Jan 19, 2015 at 08:07 UTC
    # Subroutine declarations sub handle_mo; sub usage;

    If you call the subroutines like usage(); then you won't need forward declarations