in reply to Encoding is a pain.

Personally I view this problem a little bit differently: Why doesn't XML have a way to handle arbitrary binary data? It seems like there is no way to use XML to carry generic binary data. A good example are the XML tickers here, there are characters possible in a node and other places that cannot be validly embedded in XML. This means that unless we encode all node content as hex or something like it we cannot be sure that we will return valid XML. Since we dont want to do this we have the problem that its relatively easy to embed chatracters in a node that will break many of the XML parsers that consume data from our tickers. I see your language encoding issue as just a variant of this problem. Maybe thats wrong, but thats the way it feels to me.


---
demerphq

    First they ignore you, then they laugh at you, then they fight you, then you win.
    -- Gandhi

    Flux8


Replies are listed 'Best First'.
Re^2: Encoding is a pain.
by grantm (Parson) on Sep 20, 2004 at 22:01 UTC
    Why doesn't XML have a way to handle arbitrary binary data? ... A good example are the XML tickers here, there are characters possible in a node and other places that cannot be validly embedded in XML.

    Make up your mind, are they characters or binary data? :-)

    Certainly any character which can be represented in HTML should be representable in XML. For example in HTML you could use é for 'é'. In XML, you don't have the handy mnemonic name unless you use a DTD, but you can still represent the character as é - 'é'. The HTML::Entities module can help with the conversion.

      The character versus data distinction is important. XML does have a way to express non-ASCII characters using the DTD as noted. For true binary data CDATA tags almost do it, but they're not foolproof since the binary data could contain sequences that would make the tag look like it ended before it really did. But you could encode using an agreed upon scheme, such as uuencode or base64 encoding and put that in CDATA tags. Ugly, but possible.

        Actually, in XML 1.0 CDATA sections are no good for binary data even without the delimiter issue. A CDATA section is defined to contain Chars, which in turn are defined as:

        Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000- +#x10FFFF]

        So for example control characters in the range 0x00 - 0x08 are not allowed. There are also encoding issues which would prevent you putting binary bytes in CDATA.

Re^2: Encoding is a pain.
by graff (Chancellor) on Sep 21, 2004 at 05:27 UTC
    Why doesn't XML have a way to handle arbitrary binary data?

    Well, most likely because some heavy-weight extra stuff would be needed to take care of the non-null probability that "arbitrary binary data" might, just by coincidence, contain a byte sequence that starts with 0x3c "<", ends with 0x3e ">" and has just alphanumerics (and perhaps an unfortunately well-placed slash character) in between.

    Sure, there are bound to be ways to do this, but I think the vast majority of XML users really don't want to go there (not least of all because of what it might do when passed through various network transfer protocols). (update: e.g. how would you "fix" the ubiquitous "crlf/dos-text-mode" transfer methods to handle "arbitrary binary content in XML"? This is tricky enough already just with UC16.)