perlmeditation
mirod
<p>After reading a [id://46517|recent question], but also some
[id://28388|older ones]I thought it would be worth mentionning
the basic rule of XML processing: <b>Use a parser!</b></p>
<p>As I know you won't take my word for it I will give you
just a couple of examples of things that might (that <b>will</b>)
go wrong if you use plain regexps:</p>
<ul><li><p>XML comments:
<code> <tag>value 1</tag>
<!-- <tag>value 2</tag> -->
<tag>value 3</tag></code>
will probably hurt you first, then get you to write quite
tricky regexps,</p></li>
<li><p>entities:
<code> <tag>value 1</tag>
&v2;
<tag>value 3</tag></code>
what will your regexp do with the <tt>&v2;</tt> entity? Will it
look in the appropriate place (right in the DTD, or in a separate
file, maybe remote) to get the entity declaration:
<code><!ENTITY v2 "<tag>value 2</tag>"></code></p></li>
<li><p>CDATA:
<code> <tag>value 1</tag>
<tag><![CDATA[ <tag2>value 2</tag2> ]]></tag>
<tag>value 3</tag></code>
the data inside the CDATA should be treated literally, there is
no <tt>tag2</tt> element in the document,</p></li>
<li><p>namespaces:
<code> <mynamespace:tag>value 1</mynamespace:tag>
<theirnamespace:tag>value 3</theirnamespace:tag></code>
the 2 <tt>tag</tt> elements may or may not refer to the same
element, depending on the namespace declarations in the
document.</p></li>
</ul>
<p>Not to mention the usual kind of problem with evolving XML, when
the content of the <tt>tag</tt> element starts including additional
mark-up, when the <tt>tag</tt> element gets a bunch of attributes,
or when <tt>tag2</tt> elements start popping up in between
<tt>tag</tt> elements.</p>
<p>You might think that you don't care about all of those, your XML is simple
and you don't need no stinkin' namespaces. WRONG! You are limiting
yourself to a subset of XML, but you are NOT calling it a subset. And
either you or (pity them!) the people who will maintain your code won't
remember that it is only a subset, and what subset. Plus you might have
total control over this pseudo-XML today but tomorrow? Maybe you will
receive it from some external source, or you will use an off-the-shelf
tool to create it.</p>
<p>Plus those extra features that your lovingly crafted
regexps don't grok might come in handy in the future, will you add them
to your software? Will you end up
writing your own regexp-based parser? It has been done by the way, it's
just that [XML::Parser] is faster for non-trivial XML, and I happen to
trust James Clark more than myself when it comes to writing a parser.</p>
<p>So please, anytime you want to process XML, especially if the software is
going to be used for a while, please, <p>
<p align="center"><b>Use the Parser Luke!</b></p>