Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Re: Funny characters in nodes (exactly zero)

by tye (Sage)
on Jul 08, 2007 at 22:28 UTC ( #625541=note: print w/replies, xml ) Need Help??

in reply to Funny characters in nodes

The XML standard is stupidly broken because the designers made proclamations like Tim Bray's: "XML dislikes [...] form-feed[s] [etc.] which have exactly zero shared semantics from system to system". Yes, we all know that no two people in the world ever used form-feed for the same thing. I can't even guess what anybody else would use it for. But surely not to represent a page break, since that is my personal use for it and there is exactly zero shared semantics for that character so nobody else uses it for that.

And the XML mindset of "we need the standard to require fatal errors for things that we dislike that others will surely see value in; otherwise, people will actually make XML useful by doing things we don't like" meant that XML 1.0 very thoroughly made sure that there was no reasonable way to get a form-feed character sent.

So the only choices you have when you have data containing a form-feed character are

  1. Ignore that one part of the XML 1.0 standard and send the form-feed character anyway
  2. Strip any characters that Tim Bray doesn't personally like and hope that they weren't important
  3. Come up with some proprietary way of encoding data into characters that Tim Bray doesn't dislike and force anyone consuming your data to read the XML standard and your personal "this is how to decode my characters" specification

Not surprisingly (to me, anyway), many XML parsers have actually chosen the first option above and the draft XML 1.1 standard even sees the light except in the case of nul characters (which we should be able to send as � but I doubt even XML 1.1 will overcome previous stupidity to that extent).

So, to the horror or severe disappointment of some people, PerlMonks XML generation also defaults to option 1 above. This needs to be changed but nobody has ponied up the code to make option 2 the default instead, so it must not be too big of a deal. Certainly, stripping control characters out of the XML from PerlMonks is quite simple and then allows any compliant XML parser to be used on it.

So just do that (strip them). Or, if you want to preserve control characters despite Tim's dislike, come up with your own private encoding, encode them, parse the XML, then decode them. Or find a more tolerant near-XML parser.

Update: Just for the sake of completeness, I should mention that encoding each form-feed character as  is an interesting-sounding option but it is also forbidden by the XML 1.0 standard and so has no advantages over violating the standard more simply by just leaving them directly in the XML. Indeed, the main difference of such an act would be making it more complicated to strip out disliked characters to make the XML fully compliant.

- tye        

Replies are listed 'Best First'.
Re^2: Funny characters in nodes (exactly zero)
by dmitri (Priest) on Jul 08, 2007 at 22:42 UTC
    Most of the characters that caused problems that I looked at can be safely ignored. They are not just linefeeds, however. What I'm afraid of is that they may be some multi-byte characters that make sense in another characters set (especially since uses Latin1 and not UTF-8).

      Then do option 2 or 3. Option 2 is pretty simple:

      s/(\\)|([...])/ $1 ? "\\\\" : sprintf "\\%02X", chr $2 /ge; my @elements= parseXML(); s/\\(\\|..)/ length $1 == 1 ? "\\" : chr hex $2 /ge for @elements;

      - tye        

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://625541]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2023-09-21 11:10 GMT
Find Nodes?
    Voting Booth?

    No recent polls found