What use is the 'encoding' in the XML header?

As you mentioned, you'd have to know the encoding of the file to read the encoding attribute.

However, there is a heuristic that can easily get you close enough to the "real" encoding to allow you to read the encoding attribute. This works, because the <?xml part by definition can only contain characters in the ASCII range (however they are encoded).

The XML standard even describes the exact process used to find out the encoding.

And the encoding label isn't redundant either. For example, if you use the algorithm in the XML spec to find out that some ASCII-based (or ASCII-compatible) encoding is used you still need to read the encoding to find out which one is actually use (valid candidates would be ASCII, UTF-8, any of the ISO-8859-* encodings, any of the Windows-* encodings, KOI8-R and many, many others). For the <?xml part itself it won't make a difference which one it is, but for the rest of the document, it can make a huge difference.

Regarding mis-labeled XML files: yes, it's easy to produce those, however: the XML spec clearly specifies that those files are mal-formed and as such are not correct XML. Incorrect encodings must be reported as an error (as long as they can be detected!). So it's the problem of whoever is producing the XML.

xml header character-encoding

You're quite right that it looks like an odd design. It only works because the XML declaration uses only ASCII characters, and nearly all encodings are supersets of ASCII. If you're prepared to accept something that isn't, for example EBCDIC, you can check whether the file starts with whatever the EBCDIC representation of "<?xml" is. Which means you're relying on the general level of redundancy in the header of the file, rather than purely the encoding attribute itself. Like many things in XML, it's pragmatic and works, but isn't particularly elegant.

xml header character-encoding

XML parsers are only required to support at least UTF-8 and UTF-16. The XML parser starts by trying the encodings based on the Byte Order Mark (BOM), if present (for UTF-16, UTF-32 and even UTF-8 with the dummy BOM). If none is found, then the parser will try UTF-32, UTF-16, UTF-8, ASCII and other ASCII-compatible single-byte encodings. Only then will it see the encoding attribute, and will restart parsing if necessary.

CodeHunter

What use is the 'encoding' in the XML header?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last