There is no outright problem, but better performance for the ticker generation won't hurt.
Additionally, while you might not think so much of guaranteed compliance, there are parsers around which will complain about control characters as they should. I use one of those: XML::LibXML (which I cannot praise highly enough). I've had trouble with my NN client because of control characters in the ticker once or twice.
The Javascript chatterbox client I wrote necessarily relies on the browser's parser, which means breakage on control characters in the stream at least in case of Mozilla-based browsers. I don't know how non-compliant MS' parser is in this instance.
Here are my thoughts.
Readers of the site are not interested in debugging faulty posts. In the common case, scrubbing the text should be just fine. XML::Genx does not currently bind the appropriate function; I am considering writing a patch.
For debugging purposes, there might be a textscrub=0 parameter. It still wouldn't produce illegal characters, instead it will wrap them in <char ord="##"/> elements. It's likely a human is going to be looking at the XML source directly in those cases anyway, so interpretation shouldn't be an issue.
The code to do that efficiently would be something like
eval { $w->AddText( $text ) };
if( $@ ) {
for( map ord, split //, $text ) {
eval { $w->AddCharacter( $_ ) };
if( $@ ) {
$w->StartElementLiteral( 'char' );
$w->AddAttributeLiteral( '', ord => $_ );
$w->EndElement();
}
}
}
It's a mouthful, but due to reliance on exceptions will only rarely ever need to fall through to the hard parts.
It might also be an option to forgo the textscrub=0 business altogether and do this for everyone, though old and/or clients might be more confused by these newfangled char elements popping up occasionally than they would have been with invalid XML.
Obviously this scheme is no help for attribute values, but as discussed before, we will eventually be using (nearly) attribute-free markup anyway. In particular, no user data would appear in attribute values. In that case, the consideration is what to do about old clients which do not understand the new markup format; I believe there's good reason to continue supporting them for a while, but I don't think it's a good idea to submit to the boundaries created by old mistakes forever. Deprecating the old-style ticker markup and giving people due notice of maybe six months before discontinuing support should be sufficient. (There's precedent with the private message ticker, too.)
Makeshifts last the longest. |