Wednesday, February 27, 2013

The Character Encoding Circus of XML-over-HTTP

The Character Encoding Circus of XML-over-HTTP

Welcome to the Circus

Transmitting data in the form of XML over the HTTP protocol in an ad hoc manner is simple. Just put the XML data into the body of a HTTP request, send the request, and grab the returned XML data from the body of the returned HTTP response. Nothing could be simpler - or so it seems. The fact is that the process is somewhat elaborate and contains intrinsic details which only few programmers get right.

When transmitting data over HTTP it is a good idea to state the nature of the content. This is done by setting the content-type header and containing a MIME type. The header may be something like -

Content-Type: text/plain; charset="UTF-8"

This header may also define the character encoding. When the character encoding is not set within the context of HTTP most textual values default to or must be ISO-8859-1. No matter which character encoding the content-type header specifies some things must always be encoding using ISO-8859-1. This is the case for HTTP header values in general. It is even the case for the Base64 encoded content "<username>:<password>" found within HTTP Basic and set by clients in the Authorization header - this is not dependent upon any encoding set in the content-type header and contrary to the public belief of programmers. This is due to the origin and legacy of the HTTP protocol. If the state of things was different and a choice was present, the encoding ISO-8859-1 would not always preferable over other encodings.

When the content-type does not specify a character encoding, different rules apply and depending upon the MIME type used. In one case the XML content must be interpreted as US-ASCII while in another case it must be interpreted binary - reading the optional byte order mark (BOM) and processing the encoding of the initial processing instruction -

<?xml version="1.0" encoding="UTF-16"?>

This implies that the "encoding" declaration in the processing instruction may be unnecessary and ignored.

Should it happen that some character encoding is not set right, then it may not pose a problem if both the client and the server agree upon how to do things. It is often the case that clients and servers are not coded by the same programmers or that the clients and servers do have to be compatible with existing software, and in this case things must be done eight. When the transmission of XML data over HTTP is not done right the character encoding circus is open and may imply expensive debugging sessions or result in middleware which does not quite cut it and which does not work in all intended use cases.

Specifications

When transmitting XML most professional programmers use the MIME type application/xml. There is another MIME type in common use - the type text/xml>. While both types are intended for the transmission of XML, they do have different interpretations.

Specifications do exist. The RFC 3023 does specify how five different media types for XML are supposed to be used. It is not easy to read but does contain strict answers.

Regarding the basic types text/xml and application/xml the RFC 3023 contains a section "3.6 Summary" with some clear statements. What applies to text/xml - among other things - is this:

  • Specifying the character encoding "charset" as part of the content-type is strongly recommended.
  • If the "charset" parameter of the content-type is not specified, then the default is "US-ASCII". The default of "ISO-8859-1" in HTTP is explicitly overridden.
  • An "encoding" declaration in the initial processing instruction of the XML, if present, is irrelevant.

It may well come as a surprise that the character encoding US-ASCII - and not the implicit encoding ISO-8859-1 of HTTP - is the default.

The summary also contains statements about the type application/xml:

  • Specifying the character encoding "charset" as part of the content-type is strongly recommended, and if present, it takes precedence.
  • If the "charset" parameter of the content-type is not specified, conforming XML processors must follow the requirements in section 4.3.3 of the Extensible Markup Language (XML) 1.0 specification ([XML]).

What this explains is that the content is either 1) to be read and written textually and using the "charset" of the content-type header, or 2) to be read and written binary using the "encoding" declaration in the initial processing instruction of the XML.

It is recommended to handle the XML textually by using the form -

Content-Type: application/xml; charset="UTF-8"

- containing a character encoding like "UTF-8" and in this case the "encoding" declaration in the initial processing instruction of the XML like -

<?xml version="1.0" encoding="ISO-8859-1"?>

- is to be ignored. To avoid confusion it would in fact look much better with -

<?xml version="1.0"?>

- since UTF-8 and ISO-8859-1 are not compatible or interchangeable.

The section 4.3.3 of [XML] states how to read XML content from a binary stream. Just like is done from a file. First the BOM is read together with the initial XML processing instruction and its "encoding" declaration, and then second the reader interprets the binary stream textually according to the byte order and the character encoding. Some of the details of how to do this can be read in another, quite interesting section F Autodetection of Character Encodings (Non-Normative) of [XML].

Data in the form of XML is to be handled differently when contained as text as opposed to a binary stream of octets.

Strange Behaviour and Incompatibility

Even though the specifications of text/xml and application/xml is more than a decade old - it is from the year 2001 and the infancy of the world of adopting SGML in its new incarnation of XML - the differences between the types is hardly common knowledge. These MIME types are far from the only ones used for XML, but are both in common use for many not-always-engenius and proprietary ways to communicate XML.

Complex matters tend to become a circus of strange behaviour and incompatibility only when present and existing specifications are not followed and adhered to.

Once the circus is open it includes some of the outmost unbelievable, unanticipated, hard-to-address, restrictive effects, and this is not even touched upon here by the example extravaganca present in the real world of application programming.

Should you ever hear statements about text/xml and application/xml being equal - or not in common use -, then you have acquired one of many tickets to the circus.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.