Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^2: XML::Fling begone? (ctrl, utf-8)

by Aristotle (Chancellor)
on Dec 19, 2004 at 21:30 UTC ( [id://416060]=note: print w/replies, xml ) Need Help??


in reply to Re: XML::Fling begone? (ctrl, utf-8)
in thread XML::Fling begone?

Please elaborate on control characters. I have a vague recollection of hearing something like that before but I can't pull out the specifics. And, handwaving the issue before I actually know what it is, is this something CDATA sections or entitification cannot fix in generally compatible fashion?

Makeshifts last the longest.

Replies are listed 'Best First'.
Re^3: XML::Fling begone? (ctrl, utf-8)
by tye (Sage) on Dec 19, 2004 at 23:40 UTC

    No, the XML 1.0 spec declares that non-whitespace control characters (with or without the eighth bit set) are illegal in XML and entities for illegal characters are illegal.

    I don't know it CDATA removes this restriction. I'd think it would but after being surprised by the control-character stupidity and seeing many XML near-experts also boggled by it, I won't speculate w/o reading the spec first.

    Of course, if you prefer the attribute-heavy style of XML, then CDATA won't be any help (I say w/o verifying this assumption but I'd nearly bet money on it).

    I feel PerlMonks' XML should be nearly or completely attribute-free. But that isn't much help since we already have a heavy base of ticker clients that don't handle CDATA.

    So when I said that control characters are a problem, I wasn't so XML-naive as to not have considered entities and CDATA.

    - tye        

      Ah. No, CDATA apparently doesn't help, and it would indeed be useless with attributes.

      And indeed, Genx refuses to put control characters in the output stream.

      However, I found that an AddText call with an empty string seems to consistently coerce Genx to flush whatever it had in buffer. At that point I can sneak anything I want into the stream before I resume business as usual:

      my $str = ''; my $w = XML::Genx->new; eval { $w->StartDocSender( sub { $str .= shift; } ); $w->StartElementLiteral( 'foo' ); $w->AddText( 'bar' ); $w->AddText( '' ); $str .= chr 1; $w->AddText( 'baz' ); $w->EndElement; $w->EndDocument; }; die "Writing XML failed: $@" if $@;

      This works as expected and allows to send control characters in node content, though not in attributes.

      Another alternative whose viability I can't tell is that Genx provides a genxScrubText call which simply brushes out anything illegal. Would that be acceptable? (I do wonder why we need to make it possible to send control characters in the tickers.) However, the problem here is that XML::Genx currently doesn't bind that function.

      As for XML style, I agree that attributes should be avoided. I didn't understand this when I first learned of XML, but I've come to appreciate why it is common wisdom among more insightful people. Mixed content is also a pain when you're dealing with structured rather than “document-ish” data.

      Makeshifts last the longest.

        That's a lot of thrash. I don't remember, but was there a problem you were trying to solve?

        I want control characters in XML because there is nothing preventing people from submitting control characters in their HTTP and when they do that they are likely to not get what they wanted and so it can be helpful to see what they actually submitted instead of the limited subset that XML deigns to support.

        The XML version of nodes (etc.) is supposed to give one the raw data. Having it throw away any of the raw data without a really good reason just leads to it not being trustworthy and other means having to be invented and used and forgetting to use them and yuck.

        - tye        

      FYI, here's Tim Bray's explanation of the reasoning for that:

      The only characters that XML dislikes are ASCII C0 control characters such as form-feed, vertical-tab, and those wonderful things like EOT and DLE and NAK and SYN, which have exactly zero shared semantics from system to system; which is exactly why they're not in XML.

      Update: just to be clear, I am not supporting the argument — nor rejecting it. My only actual experience is limited to systems with very little variation: Unix vs Windows on the same hardware platform. I haven't even worked on MacOS X. So I don't know enough to make any argument here.

      Makeshifts last the longest.

        Gee, some people use form feed for different things. We should be sure to prevent them from sending form feeds to each other. We'll save the world so much confusion. We'll be heros.

        Bob, that data you sent in XML needs a page break in the middle.

        We use a form feed for that, Jim.

        Gosh, so do we.

        I'm glad the designers of XML 1.1 appear to be a bit more clueful.

        Perhaps if Tim Bray had heard of an obscure thing called "ASCII" he might not produce such whoppers as "exactly zero shared semantics from system to system". And even if such were true, it'd still be a lousy reason to disallow them -- perhaps XML should require all tag names to exist in the Esperanto dictionary since most words have zero shared semantics from language to language. Sheesh.

        - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://416060]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (2)
As of 2024-04-20 05:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found