Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Perl Monks, I have a 13GB xml file that I'm trying to parse through using XML::Twig, and after getting about half way though the xml file (88,144,211 lines of 172,881,183) the parsing stopped and threw the error:

not well-formed (invalid token) at line 88144211, column 36, byte -1674366310 at /usr/perl5/site_perl/5.12/i86pc-solaris-64int/XML/Parser.pm line 187 at .../bin/xml_parser.pl line 54 at .../bin/xml_parser.pl line 54

I used sed to pull the not well formed line from the xml file, along with the preceeding 10 lines, which I've pasted here prefixed with the line numbers, and I think it's actually line 88144210 that caused my issue.

sed -n "88144200,88144211p;88144211q;" huge_xml_file.xml

88144200: <es:vsDataReportConfigSearch> 88144201: <es:a1a2SearchThresholdRsrp>-110</es:a1a2SearchThresholdRsrp +> 88144202: <es:a1a2SearchThresholdRsrq>-195</es:a1a2SearchThresholdRsrq +> 88144203: <es:a2CriticalThresholdRsrp>-122</es:a2CriticalThresholdRsrp +> 88144204: <es:a2CriticalThresholdRsrq>-195</es:a2CriticalThresholdRsrq +> 88144205: <es:hysteresisA1A2SearchRsrp>30</es:hysteresisA1A2SearchRsrp +> 88144206: <es:hysteresisA1A2SearchRsrq>150</es:hysteresisA1A2SearchRsr +q> 88144207: <es:hysteresisA2CriticalRsrp>10</es:hysteresisA2CriticalRsrp +> 88144208: <es:hysteresisA2CriticalRsrq>10</es:hysteresisA2CriticalRsrq +> 88144209: <es:timeToTriggerA1Search>640</es:timeToTriggerA1Search> 88144210: <es 88144211: <es:lbUtranB1ThresholdRscpOffset>0</es:lbUtranB1ThresholdRsc +pOffset><es:lbQciProfileHandling>1</es:lbQciProfileHandling>

Problem for me is, that I have no way of fixing these large XML files before I parse them, and I have to parse them as is. Is there a way to ignore the not well-formed lines and have my script continue on with parsing the xml file? Here is my script.

use strict; use XML::Twig; my @moList = qw(es:vsDataExternalENodeBFunction es:vsDataTermPointToEN +B es:vsDataEUtranCellFDD es:vsDataEUtranFreqRelation es:vsDataEUtranC +ellRelation es:vsDataENodeBFunction es:vsDataExternalEUtranCellFDD); # Subroutine declarations sub handle_mo; sub usage; # Parameter hash my($key,$value,%param); foreach my $item (@ARGV){ my($key,$value) = split /=/, $item; $param{$key} = $value; } # Required parameters if (!defined($param{"path"})){ usage; die "No path defined\n"; } if (!defined($param{"file"})){ usage; die "No xml file defined\n"; } my $path2xml = $param{"path"}; my $filename = $param{"file"}; my %handlers = map {$_ => \&handle_mo} @moList; my $twig = new XML::Twig( twig_roots => \%handlers); $filename = $path2xml . "/" . $filename; $twig->parsefile($filename); my $root = $twig->root; print "Parsing completed\n"; # Subroutines sub usage { print "Usage:\n xml_parser path=<directory> file=<xml_file>\n"; } sub handle_mo { my ( $t, $elt) = @_; print $elt->print, "\n"; $t->purge; }

Which I call using: ./xml_parser.pl path=/tmp file=huge_xml_file.xml > huge_xml_file.parsed.xml

I need to process these large xml files on a daily basis, and each one takes roughly 4 hours to parse on my current system. I never know when one of these miss formed lines will appear, and they are rare. Since there is only one bad line in a 172,881,183 line xml file, I'm wondering is there a way for my parser to ignore these lines rather than throwing the error?


In reply to Ignoring not well-formed (invalid token) errors by brettski

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2023-01-29 15:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?