XML parsing vs Regular expressions

karpatov has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: XML parsing vs Regular expressions
by ajt (Prior) on Feb 16, 2008 at 21:59 UTC

Many an insane person started out sane, before they tried to use regular expressions on XML. While it starts easy, it very quickly descends into chaos. As a general rule if you are working with XML, use a module that uses a real XML parser of some kind, XML::LibXML can be complicated to learn but it is very fast and complete. XML::Twig is another fast tool, and it even includes a regular expression on XML tool...

--
ajt

[reply]

Re^2: XML parsing vs Regular expressions

by Joost (Canon) on Feb 16, 2008 at 23:57 UTC

ajt

In other circumstances just the fact that a real XML parser will throw a huge tantrum on invalid input will already safe you a lot of work. And that's without mentioning some of the really nice interfaces that modules like XML::Twig can provide.

"What should it profit a man, if he should win a flame war, yet lose his cool?"

[reply]

Re: XML parsing vs Regular expressions
by Cody Pendant (Prior) on Feb 17, 2008 at 05:23 UTC

Does it:

ignore code which is commented out?
allow for attribute order changing?
cope with the characters < and > appearing inside attributes, or CDATA sections?

Nobody says perl looks like line-noise any more
kids today don't know what line-noise IS ...

[reply]

Re: XML parsing vs Regular expressions
by grinder (Bishop) on Feb 17, 2008 at 12:14 UTC

It depends whether you're giving or taking.

If you are taking at the output from one single program emitting its own brew of XML, you will usually find that it is always emitted in exactly the same way, often pretty-printed with indented nested elements, or hard wrapped against column zero all the way down.

It is extremely rare (in my experience) to encounter XML emitted by a program that is neatly word-wrapped at or before column 72. After all, that takes a lot more work, and most sane programmers have better things to do with their time. Once you figure out empirically how a given program emits its XML, you can count on it being invariant.

So, as much as it may shock the purists, you can quite easily get away with picking out what you want from a big XML file with a regexp or two, especially if you don't have to worry about context. By that I mean, for example, extracting the contents of element <HG>, if the parent is <BAR> except when the grand parent is <ZONK>

You just need a good test-suite to cover your a.. code, to ensure that things don't break when the source program is upgraded.

You cannot adopt this approach when it is you who has written the XML specification and you're dealing with how people give you their information according to your spec. Everyone will do it differently and you will indeed have to parse it. Update: or you're taking the information from a web service and thus don't have any control or forewarning when the originating program may be upgraded.

That's been my rule so far in dealing with SGML and XML for over 15 years and it has served me well so far.

• another intruder with the mooring in the heart of the Perl

[reply]

Re: XML parsing vs Regular expressions
by planetscape (Chancellor) on Feb 19, 2008 at 16:37 UTC

I note that you say:

from quite big(100 000 records)

I managed to segfault when using regexes to parse very large HTML files; I am certain you could manage to do the same using regexes to parse very large XML files. ;-)

In other words, don't. Use a module, such as XML::Twig.

HTH,

planetscape

[reply]

Re: XML parsing vs Regular expressions
by Jenda (Abbot) on Feb 18, 2008 at 14:33 UTC

I'd definitely recomend going with a proper parser. There are several styles of parsers, good for different types of uses. For what you seem to need in this case you might like XML::Rules. It's designed to let you select the things you are interested in and tweak the structure of the data as it's extracted from the XML file. You might like the style ... or not. In either case it's good to try different styles.

Jenda
Support Denmark!
Defend the free world!

[reply]


Keep It Simple, Stupid
	PerlMonks