|XP is just a number|
Re: Funny characters in nodes (exactly zero)by tye (Sage)
|on Jul 08, 2007 at 22:28 UTC||Need Help??|
The XML standard is stupidly broken because the designers made proclamations like Tim Bray's: "XML dislikes [...] form-feed[s] [etc.] which have exactly zero shared semantics from system to system". Yes, we all know that no two people in the world ever used form-feed for the same thing. I can't even guess what anybody else would use it for. But surely not to represent a page break, since that is my personal use for it and there is exactly zero shared semantics for that character so nobody else uses it for that.
And the XML mindset of "we need the standard to require fatal errors for things that we dislike that others will surely see value in; otherwise, people will actually make XML useful by doing things we don't like" meant that XML 1.0 very thoroughly made sure that there was no reasonable way to get a form-feed character sent.
So the only choices you have when you have data containing a form-feed character are
Not surprisingly (to me, anyway), many XML parsers have actually chosen the first option above and the draft XML 1.1 standard even sees the light except in the case of nul characters (which we should be able to send as � but I doubt even XML 1.1 will overcome previous stupidity to that extent).
So, to the horror or severe disappointment of some people, PerlMonks XML generation also defaults to option 1 above. This needs to be changed but nobody has ponied up the code to make option 2 the default instead, so it must not be too big of a deal. Certainly, stripping control characters out of the XML from PerlMonks is quite simple and then allows any compliant XML parser to be used on it.
So just do that (strip them). Or, if you want to preserve control characters despite Tim's dislike, come up with your own private encoding, encode them, parse the XML, then decode them. Or find a more tolerant near-XML parser.
Update: Just for the sake of completeness, I should mention that encoding each form-feed character as
is an interesting-sounding option but it is also forbidden by the XML 1.0 standard and so has no advantages over violating the standard more simply by just leaving them directly in the XML. Indeed, the main difference of such an act would be making it more complicated to strip out disliked characters to make the XML fully compliant.