Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re: binary data in XML

by Fletch (Bishop)
on Feb 28, 2008 at 18:02 UTC ( #670959=note: print w/replies, xml ) Need Help??

in reply to binary data in XML

If you're getting that error then you don't really have XML, you've got something masquerading as XML and the usual tools aren't going to handle it (well-formed-ness is sort of a bare minimum "You must be this tall to ride" requirement for data to be XML). The normal way to embed data that might contain characters that would be mistaken for markup is to either embed it in <![CDATA[...]]> (presuming your data doesn't contain any embedded ]]>'s of course :), or to encode it using something like base64 so that there's nothing but vanilla characters (but then it's incumbent upon the receiver to reconstitute the original data).

Unfortunately both of those options are going to require you to get whatever's emitting the nauga-XML to fix their output (then again it's their problem for creating output that's not well formed to begin with . . .).

Update: And if you read the XML spec not being well-formed is a fatal error, and processors conforming to the spec are supposed to toss up their hands and halt normal processing.

The cake is a lie.
The cake is a lie.
The cake is a lie.

Replies are listed 'Best First'.
Re^2: binary data in XML
by sailortailorson (Scribe) on Feb 28, 2008 at 18:19 UTC
    There is a vanishingly small chance that the data contain ']]>'. In fact, this could be an issue, but I have only been working here for two days and I am sure that the first answer I get will be that it is so unlikely that it is considered impossible. But this data has a high sorrow factor if it is not handled correctly, so I will eventually bring that up.

    I think that base 64 encoding is out of the question as it would be too slow.

    In practical terms, getting the binary data wrapped in '<![CDATA...]>' is probably the best option. Maybe for now, I can do that myself in preprocessing and get some traction on parsing these that way.

    Thank you.

      CDATA doesn't, in fact, make the least bit of difference as to what characters you can include in the data. You can't put (unencoded) binary data into XML using CDATA. The only difference between non-CDATA and CDATA is that one requires you to encode some single-character items while the other requires you to encode one 3-character sequence. This makes CDATA quite silly, IMHO.

      And, no, you can't even use &#12; to get "binary" characters into XML.

      - tye        

        Perhaps I was a little unclear, so for the benefit of the slow, I was offering two alternatives:

        • Use CDATA if you've got character data that otherwise would be interpreted as markup and don't mind the onerous task of encoding the commonly appearing sequence "]]>" instead of encoding every other offending single character individually (say you had Perl code which was otherwise free of verboten characters but chock full of > and < etc)
        • Use an out-of-band encoding such as base64 which is handled at the application layer rather than by the XML parser itself for what would otherwise be invalid character data (a sequence of octets which would be outside XML's allowed range when interpreted)

        Given that there was no example of what exactly the offending "binary data" was I thought it best to offer both options: the simple for more vanilla ASCII-y data and encoding for arbitrary octet streams.

        Update: Further clarified what types of data suggest which alternative.

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://670959]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (2)
As of 2023-10-03 00:04 GMT
Find Nodes?
    Voting Booth?

    No recent polls found