That's a very dangerous path you are takin here: trying
to process XML with regexps. On XML parsing gives a bunch of reasons on why you should not do it in general, but here is a little test:
<?xml version="1.0"?>
<doc><elt>a regular elt with a > in it</elt>
<pre> spaces are significant in this element
as well as line returns</pre>
<elt att="this is valid >" />
<!-- <elt>commented out</elt><elt>(I mean all of it)</elt> -->
<elt2><sub>if a \n is inserted before the sub element then
the document is still well-formed but not valid anymore,
as the DTD is <![CDATA[<!ELEMENT elt2 (#PCDATA|sub)>]]></sub></elt2>
<elt><![CDATA[<toto><tata>booh<tutu>]]></elt>
<elt>text with an <sub>embedded</sub> element</elt>
</doc>
gives the following output:
<?xml version="1.0"?>
<doc>
<elt>a regular elt with a > in it</elt>
<pre> spaces are significant in this element as well as
+ line returns</pre>
<elt att="this is valid >" />
<!-- <elt>commented out</elt>
<elt>(I mean all of it)</elt> -->
<elt2>
<sub>if a \n is inserted before the sub element then t
+he document is still well-formed but not valid anymore, as the DTD is
+ <![CDATA[<!ELEMENT elt2 (#PCDATA|sub)>]]>
</sub>
</elt2>
<elt>
<![CDATA[<toto>
<tata>booh<tutu>]]>
</elt>
<elt>text with an <sub>embedded</sub> element</elt>
</doc>
Your tool does OK in a lot of situations, except:
- the comment is formatted too, no big deal
- the formatting in the pre element
is broken, which can be very annoying
- the CDATA section breaks the formatting
- potentially the most dangerous, depending on
how you work with XML, is that the valid
original document is now invalid, as the \n
before the <sub> element in elt2
is significant. This kind of error can be a
nightmare to track.
And I am not even talking about problems with documents in different encodings, which could trip your regexps...
The only safe way to break an XML document without
knowing its DTD is to put the breaks in the only place
where they cannot be significant: within the tags!
That might not be pretty but it is readable:
<?xml version="1.0"?>
<
doc><
elt att="val">you can also break between the tag and the attribute and
+ between attributes</elt></doc>
By the way, there are a number of modules on CPAN that do pretty printing of XML documents, such as XML::Handler::YAWriter or XML::Filter::Reindent but I have not tested them and from reading the docs I am not sure they are what you are looking for (they are probably too slow and quite complex). But at least they would read the XML properly.
|