Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

You have xml files where this formatting tool does not work?

by LupoX (Pilgrim)
on Jul 20, 2001 at 16:44 UTC ( [id://98392]=perlquestion: print w/replies, xml ) Need Help??

LupoX has asked for the wisdom of the Perl Monks concerning the following question:

Hello Senior Monks, as a perl kid, I am not shure my formatting tool for XML files does really work with all XML files. If You have proposuals how to make my tool work more generally please offer them. If I made big mistakes in my code, or my code is not elegant, I like to hear your admonition. Heres my script:

view source

pre_perlmonk@viot.de

  • Comment on You have xml files where this formatting tool does not work?

Replies are listed 'Best First'.
Re: You have xml files where this formatting tool does not work?
by mirod (Canon) on Jul 20, 2001 at 18:21 UTC

    That's a very dangerous path you are takin here: trying to process XML with regexps. On XML parsing gives a bunch of reasons on why you should not do it in general, but here is a little test:

    <?xml version="1.0"?> <doc><elt>a regular elt with a > in it</elt> <pre> spaces are significant in this element as well as line returns</pre> <elt att="this is valid >" /> <!-- <elt>commented out</elt><elt>(I mean all of it)</elt> --> <elt2><sub>if a \n is inserted before the sub element then the document is still well-formed but not valid anymore, as the DTD is <![CDATA[<!ELEMENT elt2 (#PCDATA|sub)>]]></sub></elt2> <elt><![CDATA[<toto><tata>booh<tutu>]]></elt> <elt>text with an <sub>embedded</sub> element</elt> </doc>

    gives the following output:

    <?xml version="1.0"?> <doc> <elt>a regular elt with a > in it</elt> <pre> spaces are significant in this element as well as + line returns</pre> <elt att="this is valid >" /> <!-- <elt>commented out</elt> <elt>(I mean all of it)</elt> --> <elt2> <sub>if a \n is inserted before the sub element then t +he document is still well-formed but not valid anymore, as the DTD is + <![CDATA[<!ELEMENT elt2 (#PCDATA|sub)>]]> </sub> </elt2> <elt> <![CDATA[<toto> <tata>booh<tutu>]]> </elt> <elt>text with an <sub>embedded</sub> element</elt> </doc>

    Your tool does OK in a lot of situations, except:

    • the comment is formatted too, no big deal
    • the formatting in the pre element is broken, which can be very annoying
    • the CDATA section breaks the formatting
    • potentially the most dangerous, depending on how you work with XML, is that the valid original document is now invalid, as the \n before the <sub> element in elt2 is significant. This kind of error can be a nightmare to track.

    And I am not even talking about problems with documents in different encodings, which could trip your regexps...

    The only safe way to break an XML document without knowing its DTD is to put the breaks in the only place where they cannot be significant: within the tags!

    That might not be pretty but it is readable:

    <?xml version="1.0"?> < doc>< elt att="val">you can also break between the tag and the attribute and + between attributes</elt></doc>

    By the way, there are a number of modules on CPAN that do pretty printing of XML documents, such as XML::Handler::YAWriter or XML::Filter::Reindent but I have not tested them and from reading the docs I am not sure they are what you are looking for (they are probably too slow and quite complex). But at least they would read the XML properly.

Re: You have xml files where this formatting tool does not work?
by Hofmator (Curate) on Jul 20, 2001 at 18:30 UTC

    Yes I have such files ... parsing XML is not as easy as it might seem, that's why there are the XML modules on CPAN. E.g. your program seems to have problems with nested tags like <ul><li><ul><li>1.1</li></ul></li></ul>.

    So let me just make some general remarks to your program:

    • post your code here - in that way more people will have a look at it. Only for really big lumps of code point to some other place. You can 'hide' the code behind <readmore> tags.
    • Use a consistent style of indenting - helps to improve the readability of your code. And I would strongly suggest to aline the closing braces as follows:
      if ($read) { if ($more) { # do something } }
    • Have a look at the Getopt::... modules, especially Getopt::Long and Getopt::Declare (this one will satisfy your needs for command line processing in every respect - and I really mean *every* :). This makes your command line processing much easier to code and thus to read and understand.
    • use warnings. Either by specifying
      #!/usr/bin/perl -w #or use warnings;
      This assist you in locating possible errors.
    • Don't use global variables when it is avoidable. It starts getting really messy, once your programs grow ... use my, especially for variables that are only used locally (like e.g. $line)
    • You can simplify/optimize some of the following code (this is not a complete list!):
      foreach $line (@File_pre_format) { $file_pre_format .= $line; } # better: $file_pre_format = join '', @File_pre_format; $file_pre_format =~ s/\n//g; $file_pre_format =~ s/\t//g; # better: $file_pre_format =~ s/\n|\t//g; # or in this simple case even better: $file_pre_format =~ tr/\n\t//d;
    • and translate your error messages from German ;-)

    -- Hofmator

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://98392]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-20 12:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found