Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^3: XML::Fling begone? (ctrl, utf-8)

by tye (Sage)
on Dec 19, 2004 at 23:40 UTC ( [id://416082]=note: print w/replies, xml ) Need Help??


in reply to Re^2: XML::Fling begone? (ctrl, utf-8)
in thread XML::Fling begone?

No, the XML 1.0 spec declares that non-whitespace control characters (with or without the eighth bit set) are illegal in XML and entities for illegal characters are illegal.

I don't know it CDATA removes this restriction. I'd think it would but after being surprised by the control-character stupidity and seeing many XML near-experts also boggled by it, I won't speculate w/o reading the spec first.

Of course, if you prefer the attribute-heavy style of XML, then CDATA won't be any help (I say w/o verifying this assumption but I'd nearly bet money on it).

I feel PerlMonks' XML should be nearly or completely attribute-free. But that isn't much help since we already have a heavy base of ticker clients that don't handle CDATA.

So when I said that control characters are a problem, I wasn't so XML-naive as to not have considered entities and CDATA.

- tye        

Replies are listed 'Best First'.
Re^4: XML::Fling begone? (ctrl, utf-8)
by Aristotle (Chancellor) on Dec 20, 2004 at 01:36 UTC

    Ah. No, CDATA apparently doesn't help, and it would indeed be useless with attributes.

    And indeed, Genx refuses to put control characters in the output stream.

    However, I found that an AddText call with an empty string seems to consistently coerce Genx to flush whatever it had in buffer. At that point I can sneak anything I want into the stream before I resume business as usual:

    my $str = ''; my $w = XML::Genx->new; eval { $w->StartDocSender( sub { $str .= shift; } ); $w->StartElementLiteral( 'foo' ); $w->AddText( 'bar' ); $w->AddText( '' ); $str .= chr 1; $w->AddText( 'baz' ); $w->EndElement; $w->EndDocument; }; die "Writing XML failed: $@" if $@;

    This works as expected and allows to send control characters in node content, though not in attributes.

    Another alternative whose viability I can't tell is that Genx provides a genxScrubText call which simply brushes out anything illegal. Would that be acceptable? (I do wonder why we need to make it possible to send control characters in the tickers.) However, the problem here is that XML::Genx currently doesn't bind that function.

    As for XML style, I agree that attributes should be avoided. I didn't understand this when I first learned of XML, but I've come to appreciate why it is common wisdom among more insightful people. Mixed content is also a pain when you're dealing with structured rather than “document-ish” data.

    Makeshifts last the longest.

      That's a lot of thrash. I don't remember, but was there a problem you were trying to solve?

      I want control characters in XML because there is nothing preventing people from submitting control characters in their HTTP and when they do that they are likely to not get what they wanted and so it can be helpful to see what they actually submitted instead of the limited subset that XML deigns to support.

      The XML version of nodes (etc.) is supposed to give one the raw data. Having it throw away any of the raw data without a really good reason just leads to it not being trustworthy and other means having to be invented and used and forgetting to use them and yuck.

      - tye        

        There is no outright problem, but better performance for the ticker generation won't hurt.

        Additionally, while you might not think so much of guaranteed compliance, there are parsers around which will complain about control characters as they should. I use one of those: XML::LibXML (which I cannot praise highly enough). I've had trouble with my NN client because of control characters in the ticker once or twice.

        The Javascript chatterbox client I wrote necessarily relies on the browser's parser, which means breakage on control characters in the stream at least in case of Mozilla-based browsers. I don't know how non-compliant MS' parser is in this instance.

        Here are my thoughts.

        Readers of the site are not interested in debugging faulty posts. In the common case, scrubbing the text should be just fine. XML::Genx does not currently bind the appropriate function; I am considering writing a patch.

        For debugging purposes, there might be a textscrub=0 parameter. It still wouldn't produce illegal characters, instead it will wrap them in <char ord="##"/> elements. It's likely a human is going to be looking at the XML source directly in those cases anyway, so interpretation shouldn't be an issue.

        The code to do that efficiently would be something like

        eval { $w->AddText( $text ) }; if( $@ ) { for( map ord, split //, $text ) { eval { $w->AddCharacter( $_ ) }; if( $@ ) { $w->StartElementLiteral( 'char' ); $w->AddAttributeLiteral( '', ord => $_ ); $w->EndElement(); } } }

        It's a mouthful, but due to reliance on exceptions will only rarely ever need to fall through to the hard parts.

        It might also be an option to forgo the textscrub=0 business altogether and do this for everyone, though old and/or clients might be more confused by these newfangled char elements popping up occasionally than they would have been with invalid XML.

        Obviously this scheme is no help for attribute values, but as discussed before, we will eventually be using (nearly) attribute-free markup anyway. In particular, no user data would appear in attribute values. In that case, the consideration is what to do about old clients which do not understand the new markup format; I believe there's good reason to continue supporting them for a while, but I don't think it's a good idea to submit to the boundaries created by old mistakes forever. Deprecating the old-style ticker markup and giving people due notice of maybe six months before discontinuing support should be sufficient. (There's precedent with the private message ticker, too.)

        Makeshifts last the longest.

Re^4: XML::Fling begone? (ctrl, utf-8)
by Aristotle (Chancellor) on Dec 31, 2004 at 15:08 UTC

    FYI, here's Tim Bray's explanation of the reasoning for that:

    The only characters that XML dislikes are ASCII C0 control characters such as form-feed, vertical-tab, and those wonderful things like EOT and DLE and NAK and SYN, which have exactly zero shared semantics from system to system; which is exactly why they're not in XML.

    Update: just to be clear, I am not supporting the argument — nor rejecting it. My only actual experience is limited to systems with very little variation: Unix vs Windows on the same hardware platform. I haven't even worked on MacOS X. So I don't know enough to make any argument here.

    Makeshifts last the longest.

      Gee, some people use form feed for different things. We should be sure to prevent them from sending form feeds to each other. We'll save the world so much confusion. We'll be heros.

      Bob, that data you sent in XML needs a page break in the middle.

      We use a form feed for that, Jim.

      Gosh, so do we.

      I'm glad the designers of XML 1.1 appear to be a bit more clueful.

      Perhaps if Tim Bray had heard of an obscure thing called "ASCII" he might not produce such whoppers as "exactly zero shared semantics from system to system". And even if such were true, it'd still be a lousy reason to disallow them -- perhaps XML should require all tag names to exist in the Esperanto dictionary since most words have zero shared semantics from language to language. Sheesh.

      - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://416082]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (2)
As of 2024-04-26 03:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found