The best way to handle different type of XML files

mahira has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: The best way to handle different type of XML files by toolic (Bishop) on Nov 21, 2009 at 13:30 UTC
XML::Simple works well if the structure of your XML is simple. The module's author summarizes it's capabilities quite nicely in the POD and offers alternate modules, depending on your necessities: http://search.cpan.org/~grantm/XML-Simple-2.18/lib/XML/Simple.pm#WHERE_TO_FROM_HERE? The XML::Simple module review also offers advice on when it is appropriate to use XML::Simple. XML::Twig has suited my general-purpose XML needs.	[reply]
Re^2: The best way to handle different type of XML files by ikegami (Patriarch) on Nov 21, 2009 at 15:44 UTC
XML::Simple can't even handle elements that can repeat a variable number of times without making a mess. You need to hold it's hand or you won'tbe able to predict the structure you'll get. It's much less trouble to get a consistent tree every time. That's why I use (the much faster) XML::LibXML. As for the variations between the document formats, it becomes a question of using the right XPath for the document in question. You can use a lookup table for that. `my $xpath = $xpath_by_doctype{$doctype};` [download] Other solutions based on XPath (like XML::Twig) should do fine too.	[reply] [d/l]
Re^3: The best way to handle different type of XML files by mpeever (Friar) on Nov 21, 2009 at 20:22 UTC
I've found XML::Simple is really useful for prototyping a solution, frequently with either a dumbed-down schema or a subset of a real document. It lets me stub out the XML bit of the program so I can get the rest of the logic flowing. After that, I usually end up re-writing that code to XML::XPath or its ilk. I prefer to use XML::Simple when possible, because it takes a lot less care and feeding in the simplest cases. But like you pointed out, it blows up pretty quickly once the document is more than trivial. Of course, it sounds like the OP has gotten past the prototype stage already. Frankly, it sounds like the project got a lot farther on XML::Simple than I would have expected.	[reply]
Re^4: The best way to handle different type of XML files (Why I don't think much of XML::Simple) by ikegami (Patriarch) on Nov 21, 2009 at 23:14 UTC
Re^5: The best way to handle different type of XML files by mpeever (Friar) on Nov 22, 2009 at 22:02 UTC
Re: The best way to handle different type of XML files by Tanktalus (Canon) on Nov 21, 2009 at 15:20 UTC
XML::Twig + XPath (either the XPath built-in to XML::Twig, or, if you need a bit more, XML::XPath will work with twig objects). Suddenly, you won't have to care about depth, just names. You may have to care about multiple names for a given value, such as //middle//name, or just //name, or include attributes or whatever, but it's all hierarchies of names, regardless of depth. Of course, it's always nice to have a standard format for all your vendors to use, but, unless you're Walmart, good luck with that. :-)	[reply]
Re: The best way to handle different type of XML files by 7stud (Deacon) on Nov 21, 2009 at 16:38 UTC
Currently I am working on a project that will handle several XML files from different sources with different formats. I am trying to handle all of them with a single piece of code but it is hard because nodes are different, depth is different etc. I don't see how that is possible. If you have one XML file that has a tag called <super_duper_product> nested inside one other tag, and another XML file that has a tag called <item89001> nested inside three tags, I don't think there is any way you can use the same script to extract both tags. There has got to be some pattern you can exploit, either the tags have similar names, or they have similar locations in the document tree, or they have identical siblings or child elements, or similar text. Something. As the first responders noted, XPath makes it easy to find a specific tag name anywhere in the document. XPath lets you treat an XML file as if it were a directory on your file system. You locate elements using path notation: /bookstore/book/title. XPath conveniently lets you omit the first 'directory', like this: `findnodes("//<book>");` [download] which searches for all <book> tags anywhere in the document. The LibXML module provides that findnodes method which allows you to specify an XPath for the search.	[reply] [d/l]
Re: The best way to handle different type of XML files by grantm (Parson) on Nov 23, 2009 at 00:12 UTC
As the author of XML::Simple, I recommend XML::LibXML for anything but the most trivial of XML work.	[reply]
Re: The best way to handle different type of XML files by pajout (Curate) on Nov 23, 2009 at 13:11 UTC
It is hard to choose some tool, because you never know all formats which you have to process... It is my experience of very similar situation. The crucial question is "How to implement my logic on various, mostly unpredictable data structures?". I think that XML::Simple is good for the simplest examples. It needs some experience with that tools, but consider XML::Twig, XML::Rules, tools performing xslt transformation or more generally, using XPath lang, iterating DOM structure, iterating object structure of XML::Trivial (my kid :) or, for instance, more esoteric STX language, http://stx.sourceforge.net/ . Principially, you can spare some work using some "scripting" language, which is oriented for xml processing (xslt, stx, xpath), but you can loose some power. And oppositely using Perl, iterating through Perl data/object representation of the document. Another problem could be processing of huge documents. In this case XML::Twig or stx can work for you, but consider raw processing of some XML::Parser output too.	[reply]
Re: The best way to handle different type of XML files by tfrayner (Curate) on Nov 24, 2009 at 08:59 UTC
I'm also a fan of XML::LibXML, but I'd also put in a suggestion that you may want to consider combining it with XML::LibXSLT. I've recently been through a similar situation (although I didn't have the luxury of only dealing with XML as input), and I found that life became much simpler when I developed a single unified XML schema that closely reflected my MySQL database table structure, for which I could then easily write a database loader module. At that point it became fairly straightforward to write XSLT stylesheet documents to convert the myriad input XML formats into the database-compliant schema format. It's possible this approach is a bit over-engineered compared to the XML::Simple approach, but then again it is much easier to maintain in the long run. Best of luck, Tim	[reply]