Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

XML processing taking too much time

by koti688 (Sexton)
on Mar 26, 2009 at 07:43 UTC ( [id://753359]=perlquestion: print w/replies, xml ) Need Help??

koti688 has asked for the wisdom of the Perl Monks concerning the following question:

I am using following packages.

use XML::DOM;
use XML::DOM::Xpath;
use XML::Writer;


I want to processing some very huge files like 2gb , 3gb and 4gb xml files. But processing these files with the above packages is taking too much of time (3 to 4 hours). It is working fine for small size files.Is there any other way (any other packages)to reduce the time for processing huge files. My version of perl is perl, v5.8.8 built for MSWin32-x86-multi-thread.
I am using PPM-Gui to intsll packages.

I went for some goggling and find XML::twig is better to process very large files. but i am not able to find that package in my PPM-GUI. Please Help/advice.

Thanks in advance.

Replies are listed 'Best First'.
Re: XML processing taking too much time
by mirod (Canon) on Mar 26, 2009 at 07:57 UTC

    You can find PPMs for XML::Twig in Kobe's repository: http://cpan.uwinnipeg.ca/module/XML::Twig.

    That said, I don't know how you load a 4gb file in XML::DOM, how much memory do you have on that machine? 40gb? With XML::Twig you can process parts of the XML (twigs as opposed to the whole tree ;--) so you should be able to keep memory usage lower, maybe much lower, depending on what you need to do, which should speedup processing. But if you can load the entire tree in memory and you can install libxml2, then you can also try using XML::LibXML, porting the XML::DOM code would be much easier in this case, and XML::LibXML is much faster than XML::DOM.

      Hmm Yes. My memory is 2Gb only.:(

      My Xml contains Multiple blocks of data . one block is like below.

      <SigData>
      <KVPair>
      <Key>eb08f9990ae6545f9ea625412c71f24f7bf007ed</Key>
      <Value>c73df5228c35c419f884ba9571310cd7</Value>
      </KVPair>
      </SigData>


      i need to load these elements <key>,<value> of the tree into these arrays like

      my @keys = getValuesFromPath($sigData ,"/SigData/KVPair/Key");
      my @values = getValuesFromPath($sigData ,"/SigData/KVPair/Value");

      So you want me to use XML::LibXML also along with XML::Twig???

        I was just surprised that you could use XML::DOM at all on files of that size. And it looks like you can't actually, a 1gb XML file would take at least 8gb in memory using XML::DOM. So it might be interesting to know how you did it. What I meant was that if you had been able to do it, by throwing large amounts of memory at the problem, then XML::LibXML would have been an option.

        With XML::Twig you can very easily extract the k/v pairs:

        my $t= XML::Twig->new( twig_roots => { SigData => sub { push @keys, $_->field( 'Key'); push @values, $_->field( 'Value'); $_->purge; } }, ) ->parsefile("my_big_fat_xml_file.xml");

        Of course the @keys and @values arrays are going to be huge too, so you might still want to add a few GB of RAM to your machine, but at least the XML structure will never take up more than a few bytes.

        Other possible options are XML::Rules (I expect jenda to show up and give you an example as soon as he wakes up, and maybe the new XML::Reader, which seems quite appropriate. XML::LibXML's pull mode might also be appropriate, but I have never used it so I can't comment on it.

Re: XML processing taking too much time
by dHarry (Abbot) on Mar 26, 2009 at 09:10 UTC

    I have parsed huge XML files up to 2GB and I can tell you it takes time, especially if you take an XML-ish approach, i.e. use a XML parser. Of course any DOM-approach is out of the question, loading such a document in memory is asking for trouble.

    It depends of course on the complexity of the XML file, the hardware configuration etc. etc. but if you parse huge XML files you basically have to be patient. Don’t expect miracles from other modules, there is no silver bullet when parsing huge XML files.

    In my situation parsing the file and updating some nodes took over 1 hour for a 1GB file with XML::Twig. For my requirements this was not sufficient.

    Some ideas most of which I explored in the past:

    • Rethink the problem!, maybe it’s possible to decrease the size of the XML files. I mean XML files of 3-4GB do sound a bit weird/large.
    • Map the XML structure onto a relational database, do a bulk load and let the DB do the work for you.
    • Choose a non-XML approach, instead of parsing the file with a parser you might opt for handcrafted Perl solution.
    • Take a look at other environments, I got some very good performance out of Xalan using SAX. There is C++ version if you’re allergic to Java;)
    • There are native XML databases with good performance but this most likely means spending money!

    HTH,
    dHarry

Re: XML processing taking too much time
by Anonymous Monk on Mar 26, 2009 at 07:47 UTC
Re: XML processing taking too much time
by pajout (Curate) on Mar 26, 2009 at 10:57 UTC
    I would like to emphasize two ways:
    1. Not recommended: If you are absolutely sure about xml format (considering new lines e.t.c.), you can read the file line by line and just apply regexps on it.
    2. Recommended: Hook the events of some XML::Parser.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://753359]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (1)
As of 2024-04-25 00:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found