Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^5: XML::Fling begone? (ctrl, utf-8)

by tye (Sage)
on Dec 22, 2004 at 16:40 UTC ( [id://416817]=note: print w/replies, xml ) Need Help??


in reply to Re^4: XML::Fling begone? (ctrl, utf-8)
in thread XML::Fling begone?

That's a lot of thrash. I don't remember, but was there a problem you were trying to solve?

I want control characters in XML because there is nothing preventing people from submitting control characters in their HTTP and when they do that they are likely to not get what they wanted and so it can be helpful to see what they actually submitted instead of the limited subset that XML deigns to support.

The XML version of nodes (etc.) is supposed to give one the raw data. Having it throw away any of the raw data without a really good reason just leads to it not being trustworthy and other means having to be invented and used and forgetting to use them and yuck.

- tye        

Replies are listed 'Best First'.
Re^6: XML::Fling begone? (ctrl, utf-8)
by Aristotle (Chancellor) on Dec 25, 2004 at 10:21 UTC

    There is no outright problem, but better performance for the ticker generation won't hurt.

    Additionally, while you might not think so much of guaranteed compliance, there are parsers around which will complain about control characters as they should. I use one of those: XML::LibXML (which I cannot praise highly enough). I've had trouble with my NN client because of control characters in the ticker once or twice.

    The Javascript chatterbox client I wrote necessarily relies on the browser's parser, which means breakage on control characters in the stream at least in case of Mozilla-based browsers. I don't know how non-compliant MS' parser is in this instance.

    Here are my thoughts.

    Readers of the site are not interested in debugging faulty posts. In the common case, scrubbing the text should be just fine. XML::Genx does not currently bind the appropriate function; I am considering writing a patch.

    For debugging purposes, there might be a textscrub=0 parameter. It still wouldn't produce illegal characters, instead it will wrap them in <char ord="##"/> elements. It's likely a human is going to be looking at the XML source directly in those cases anyway, so interpretation shouldn't be an issue.

    The code to do that efficiently would be something like

    eval { $w->AddText( $text ) }; if( $@ ) { for( map ord, split //, $text ) { eval { $w->AddCharacter( $_ ) }; if( $@ ) { $w->StartElementLiteral( 'char' ); $w->AddAttributeLiteral( '', ord => $_ ); $w->EndElement(); } } }

    It's a mouthful, but due to reliance on exceptions will only rarely ever need to fall through to the hard parts.

    It might also be an option to forgo the textscrub=0 business altogether and do this for everyone, though old and/or clients might be more confused by these newfangled char elements popping up occasionally than they would have been with invalid XML.

    Obviously this scheme is no help for attribute values, but as discussed before, we will eventually be using (nearly) attribute-free markup anyway. In particular, no user data would appear in attribute values. In that case, the consideration is what to do about old clients which do not understand the new markup format; I believe there's good reason to continue supporting them for a while, but I don't think it's a good idea to submit to the boundaries created by old mistakes forever. Deprecating the old-style ticker markup and giving people due notice of maybe six months before discontinuing support should be sufficient. (There's precedent with the private message ticker, too.)

    Makeshifts last the longest.

      Additionally, while you might not think so much of guaranteed compliance,

      I'll guess that the "guarantee" is that no matter how stupidly you try to use Genx, it refuses to produce invalid XML (in so far as the code is bug free in specification as well as implementation).

      Other than control characters, in what ways are we currently non-compliant? I'm not aware of any gain to be had there (I'll get to control characters shortly).

      So if someone decides to do something really stupid, then Genx will guarantee the output will be either empty or compliant, really stupid XML.

      Standards are great because of the benefits they provide. So standards compliance is a secondary goal, one that you shoot for because it facilitates many primary goals (mostly flavors of interoperability). Getting most or all of the benefits that are supposed to come with compliance is the primary goal. Putting a secondary goal ahead of your primary goals is a common mistake I see and that I try to avoid.

      there are parsers around which will complain about control characters as they should.

      I said as much. The last time this came up I proposed that we default to stripping control characters and have an option to request XML 1.1 which would preserve control characters.

      Part of the reason that I think that the default should strip control characters is because being XML compliant is important. I do consider compliance important, I just don't blindly put it ahead of reaping real benefits.

      Well, XML 1.1 probably hasn't received final approval yet so Genx surely refuses to produce it.

      <char ord="##"/>

      Genx will probably allow such to be output since it is a good example of compliant, stupid XML. (: I'm certainly not aware of any XML parsers that will translate that back into the proper characters. You've broken single fields into multiple pieces such that they are a pain to put back together. You lost the primary goal by concentrating on a secondary goal.

      So I'm still not sure what goals you have here. In some ways, your goals appear to be "use Genx" and "ditch XML::Fling". Useless goals. If one of your goals is compliance, than patch things to strip control characters by default (if that still isn't the case).

      If your goal is performance gain then please at least demonstrate one instead of guessing that there is one. But currently I doubt that would be enough to overcome the drawbacks of Genx feature-wise.

      I'd like to offer UTF-8 XML but Latin-1 has advantages in some cases so I don't want to stop offering it, especially since it is what we've produced for so long.

      If a future version of Genx supports Latin-1 and XML 1.1, then it might make a good replacement.

      As things stand, if you have some strong desire to use Genx, then you'd need to make Genx just an option, while still supporting our current methods.

      - tye        

        I'll guess that the "guarantee" is that no matter how stupidly you try to use Genx, it refuses to produce invalid XML

        Indeed.

        Okay, let's go back to the beginning and list my reasons for all this noise.

        We use XML::Fling, specifically written for PM (and E2?), because the less dumb XML generators are all too slow.

        We currently do not produce broken XML other than potentially inserting invalid characters, but there is no guarantee that the output remains wellformed when tickers are patched or newly created. I'll make an analogy with strict here.

        My thinking is: if something is faster yet than Fling, and additionally guarantees compliance — then it seems like a worthwhile option to pursue.

        You find fault, and I can see why, with the fact that so far I'm just talking about the a performance improvement without investigating it. The problem with that is that we have not reached a consensus about control characters, and if Genx wouldn't even be considered due to these issues, it doesn't seem to make sense to invest actual effort (particularly as that would likely incur patching XML::Genx; the bindings are still very young).

        You've broken single fields into multiple pieces such that they are a pain to put back together.

        Yes, I know, and I was thinking about that even as I wrote it. I don't know if that means I lost sight of the primary goal. I'm trying to satisfy both primary and secondary goals at once. I just didn't see a better way at the time.

        But how double-encoding, ie passing a NUL through as &amp;#10;? As a bonus, this would work for attribute values as well. And I think it's actually a pretty good idea since decoding it is simple (just put the string through an entity decoder a second time after you get it from the XML parser) and it doesn't require double-encoding any non-control characters other than the & itself. The scheme basically uses XML against itself to attain validity. :-)

        How does that sound?

        Makeshifts last the longest.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://416817]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (1)
As of 2024-04-19 18:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found