sailortailorson has asked for the wisdom of the Perl Monks concerning the following question:

I want to use XML::Twig to parse XML that holds binary data guaranteed not to be any any particular format.

Without any input filtering, it incurs a 'not well-formed (invalid token)' warning as soon as it hits the binary data (of course).

When I read about/try the input filter options, they seem like they all expect some particular character encoding.

Is there an accepted way to convert or protect arbitrary binary data that is inline in the XML when parsing it?

The XML is not encrypted, just the tagged data.

Replies are listed 'Best First'.
Re: binary data in XML
by Fletch (Bishop) on Feb 28, 2008 at 18:02 UTC

    If you're getting that error then you don't really have XML, you've got something masquerading as XML and the usual tools aren't going to handle it (well-formed-ness is sort of a bare minimum "You must be this tall to ride" requirement for data to be XML). The normal way to embed data that might contain characters that would be mistaken for markup is to either embed it in <![CDATA[...]]> (presuming your data doesn't contain any embedded ]]>'s of course :), or to encode it using something like base64 so that there's nothing but vanilla characters (but then it's incumbent upon the receiver to reconstitute the original data).

    Unfortunately both of those options are going to require you to get whatever's emitting the nauga-XML to fix their output (then again it's their problem for creating output that's not well formed to begin with . . .).

    Update: And if you read the XML spec not being well-formed is a fatal error, and processors conforming to the spec are supposed to toss up their hands and halt normal processing.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      There is a vanishingly small chance that the data contain ']]>'. In fact, this could be an issue, but I have only been working here for two days and I am sure that the first answer I get will be that it is so unlikely that it is considered impossible. But this data has a high sorrow factor if it is not handled correctly, so I will eventually bring that up.

      I think that base 64 encoding is out of the question as it would be too slow.

      In practical terms, getting the binary data wrapped in '<![CDATA...]>' is probably the best option. Maybe for now, I can do that myself in preprocessing and get some traction on parsing these that way.

      Thank you.

        CDATA doesn't, in fact, make the least bit of difference as to what characters you can include in the data. You can't put (unencoded) binary data into XML using CDATA. The only difference between non-CDATA and CDATA is that one requires you to encode some single-character items while the other requires you to encode one 3-character sequence. This makes CDATA quite silly, IMHO.

        And, no, you can't even use &#12; to get "binary" characters into XML.

        - tye        

Re: binary data in XML (semantics)
by tye (Sage) on Feb 28, 2008 at 18:33 UTC

    The designers of XML have saved you from yourself. You aren't even allowed to send formfeed in XML so you are just crazy thinking XML would allow something so insane as sending binary data! Be glad the XML designers had your best interests in mind! If not for their keen insight and concern, you'd be sending binary data already and boy would you soon regret it!

    As I note in Re: Funny characters in nodes (exactly zero), Tim Bray declared "XML dislikes [...] form-feed[s] [etc.] which have exactly zero shared semantics from system to system". Yes you'll never find two systems in the world that both use "form feed" to represent a page break.

    So you need to either invent your own, proprietary encoding for the binary data and encode the binary data into XML-approved characters (to ensure "shared semantics", oh the irony) and then teach every party involved this new proprietary encoding. Or, you could just find one of the many "XML parsers" (the scare quotes are required by the XML standard) that have the good sense to at least optionally ignore the requirements that they complain about characters that Tim Bray dislikes (something that XML 1.1 will also likely mostly do).

    If you can't find such an "XML parser", then you could also just use a simplistic scheme to transform the "not well-formed 'XML'" into XML and then transform all parsed-out values to recover the original binary data. For example, replace any control characters (or other XML-hated characters) and any backslashes with \xx where "xx" is the hex value of the byte (I don't think there are any Unicode characters that XML hates that won't fit in one byte) and then perform the reverse translation on the extracted values.

    - tye        

      Perhaps I am crazy, but I am merely asking about a practice that is already in place where I have just started a new job.

      Perhaps you mean "they" are crazy. Actually, I might be able to agree with that statement, but it would be out of the scope of any engineering approach to answering this question. Perhaps, by 'you are ... crazy', you simply mean 'not me' (from your point of view, of course). If, by 'you' you simply mean "someone other than myself", then OK, that's, well, odd, and random, but we can go on with part of this discussion that actually addresses the real problem.

      Anyway, will I still be crazy (by your definition) if I gather the following from what you say?

      To wit:

      That the practice in place here of using binary data inline in XML is unusual and deprecated or at least that my new coworkers have probably implemented a system that offends Tim Bray.

      That it is not likely remedied by use of <![CDATA...>.

      That in effect, hex encoding is the only way to use XML::Twig.


      that maybe I ought to use HTML::TreeBuilder::XPath, or some other such library that does not kill itself when it sees XML that does not strictly conform to the standard?

        HTML::TreeBuilder::XPath might be what you're looking for. Be aware of 2 things though: it loads the entire document in memory. Then its XML export method (as_XML, inherited from HTML::TreeBuilder) does not care about encoding at all, so it might very well produce non-well-formed XML. Which is probably what you want, come to think about it.

Re: binary data in XML
by sgt (Deacon) on Feb 29, 2008 at 00:08 UTC

    CDATA sections originate from a related SGML concept and are used for verbatim text (like code).

    They are simply a way to stop any special entity-like processing and are certainly *not* meant for binary data. They are many schemes for binary XML.

    If the document is valid UTF-8, then base64-enc of values is useful and does not need escaping as quotes are not part of the image set. Obviously you could also use directly the encoding data in a CDATA section but it seems less useful than:

    <data local_enc="base64"> <value>c3RlcGhhbgo=</value> </data>
    The whole arsenal of MIME/PEM conversions can be used. But be careful, there is one pitfall: the scheme *breaks* if the xml document uses an (at least 2 bytes) encoding like UTF-16 as a base64 sequence of bytes is not valid UTF-16. The solution is to use an extra conversion like  iconv -f ISO-88859-1 -t UTF-16 and its inverse.

    Finally one way to encode  ]]> would be to close the CDATA section after outputting ]] and then open another one starting with  >

    cheers --stephan