Re^3: XML::Fling begone? (ctrl, utf-8)

Replies are listed 'Best First'.
Re^4: XML::Fling begone? (ctrl, utf-8) by Aristotle (Chancellor) on Dec 20, 2004 at 01:36 UTC
Ah. No, CDATA apparently doesn't help, and it would indeed be useless with attributes. And indeed, Genx refuses to put control characters in the output stream. However, I found that an AddText call with an empty string seems to consistently coerce Genx to flush whatever it had in buffer. At that point I can sneak anything I want into the stream before I resume business as usual: `my $str = ''; my $w = XML::Genx->new; eval { $w->StartDocSender( sub { $str .= shift; } ); $w->StartElementLiteral( 'foo' ); $w->AddText( 'bar' ); $w->AddText( '' ); $str .= chr 1; $w->AddText( 'baz' ); $w->EndElement; $w->EndDocument; }; die "Writing XML failed: $@" if $@;` [download] This works as expected and allows to send control characters in node content, though not in attributes. Another alternative whose viability I can't tell is that Genx provides a genxScrubText call which simply brushes out anything illegal. Would that be acceptable? (I do wonder why we need to make it possible to send control characters in the tickers.) However, the problem here is that XML::Genx currently doesn't bind that function. As for XML style, I agree that attributes should be avoided. I didn't understand this when I first learned of XML, but I've come to appreciate why it is common wisdom among more insightful people. Mixed content is also a pain when you're dealing with structured rather than “document-ish” data. Makeshifts last the longest.	[reply] [d/l]
Re^5: XML::Fling begone? (ctrl, utf-8) by tye (Sage) on Dec 22, 2004 at 16:40 UTC
That's a lot of thrash. I don't remember, but was there a problem you were trying to solve? I want control characters in XML because there is nothing preventing people from submitting control characters in their HTTP and when they do that they are likely to not get what they wanted and so it can be helpful to see what they actually submitted instead of the limited subset that XML deigns to support. The XML version of nodes (etc.) is supposed to give one the raw data. Having it throw away any of the raw data without a really good reason just leads to it not being trustworthy and other means having to be invented and used and forgetting to use them and yuck. - tye	[reply]
Re^6: XML::Fling begone? (ctrl, utf-8) by Aristotle (Chancellor) on Dec 25, 2004 at 10:21 UTC
There is no outright problem, but better performance for the ticker generation won't hurt. Additionally, while you might not think so much of guaranteed compliance, there are parsers around which will complain about control characters as they should. I use one of those: XML::LibXML (which I cannot praise highly enough). I've had trouble with my NN client because of control characters in the ticker once or twice. The Javascript chatterbox client I wrote necessarily relies on the browser's parser, which means breakage on control characters in the stream at least in case of Mozilla-based browsers. I don't know how non-compliant MS' parser is in this instance. Here are my thoughts. Readers of the site are not interested in debugging faulty posts. In the common case, scrubbing the text should be just fine. XML::Genx does not currently bind the appropriate function; I am considering writing a patch. For debugging purposes, there might be a `textscrub=0` parameter. It still wouldn't produce illegal characters, instead it will wrap them in `<char ord="##"/>` elements. It's likely a human is going to be looking at the XML source directly in those cases anyway, so interpretation shouldn't be an issue. The code to do that efficiently would be something like `eval { $w->AddText( $text ) }; if( $@ ) { for( map ord, split //, $text ) { eval { $w->AddCharacter( $_ ) }; if( $@ ) { $w->StartElementLiteral( 'char' ); $w->AddAttributeLiteral( '', ord => $_ ); $w->EndElement(); } } }` [download] It's a mouthful, but due to reliance on exceptions will only rarely ever need to fall through to the hard parts. It might also be an option to forgo the `textscrub=0` business altogether and do this for everyone, though old and/or clients might be more confused by these newfangled `char` elements popping up occasionally than they would have been with invalid XML. Obviously this scheme is no help for attribute values, but as discussed before, we will eventually be using (nearly) attribute-free markup anyway. In particular, no user data would appear in attribute values. In that case, the consideration is what to do about old clients which do not understand the new markup format; I believe there's good reason to continue supporting them for a while, but I don't think it's a good idea to submit to the boundaries created by old mistakes forever. Deprecating the old-style ticker markup and giving people due notice of maybe six months before discontinuing support should be sufficient. (There's precedent with the private message ticker, too.) Makeshifts last the longest.	[reply] [d/l]
Re^7: XML::Fling begone? (goals) by tye (Sage) on Jan 01, 2005 at 05:41 UTC
Re^8: XML::Fling begone? (goals) by Aristotle (Chancellor) on Jan 01, 2005 at 12:57 UTC
Some notes below your chosen depth have not been shown here
Re^4: XML::Fling begone? (ctrl, utf-8) by Aristotle (Chancellor) on Dec 31, 2004 at 15:08 UTC
FYI, here's Tim Bray's explanation of the reasoning for that: The only characters that XML dislikes are ASCII `C0` control characters such as form-feed, vertical-tab, and those wonderful things like `EOT` and `DLE` and `NAK` and `SYN`, which have exactly zero shared semantics from system to system; which is exactly why they're not in XML. Update: just to be clear, I am not supporting the argument — nor rejecting it. My only actual experience is limited to systems with very little variation: Unix vs Windows on the same hardware platform. I haven't even worked on MacOS X. So I don't know enough to make any argument here. Makeshifts last the longest.	[reply]
Re^5: XML::Fling begone? (shared semantics) by tye (Sage) on Jan 01, 2005 at 05:55 UTC
Gee, some people use form feed for different things. We should be sure to prevent them from sending form feeds to each other. We'll save the world so much confusion. We'll be heros. Bob, that data you sent in XML needs a page break in the middle. We use a form feed for that, Jim. Gosh, so do we. I'm glad the designers of XML 1.1 appear to be a bit more clueful. Perhaps if Tim Bray had heard of an obscure thing called "ASCII" he might not produce such whoppers as "exactly zero shared semantics from system to system". And even if such were true, it'd still be a lousy reason to disallow them -- perhaps XML should require all tag names to exist in the Esperanto dictionary since most words have zero shared semantics from language to language. Sheesh. - tye	[reply]


The stupid question is the question not asked
	PerlMonks