Xml namespace weirdness in MSXML4
Xml namespace weirdness in MSXML4
This post has been a while brewing. It began a few months ago when I was working on a project with my colleague Rob Wittenbols. He was building an XML document using the MSXML4 DOM API, and he wanted to have the same degree of control over where namespace prefixes would be declared as if he was deserializing the document from a text file. We discussed the subject briefly and then went off to do other things. Later in the day, we'd obviously both been pondering the issue, and had come up with different answers.
I had looked at the API and discovered that it was impossible to do what he wanted. The API didn't support it. You could create a node, and put it in a namespace, and even give that node a prefix, but you couldn't declare another namespace prefix at the same time. That made perfect sense, I said, because namespace prefixes were purely to do with the serialized view of the document, and where they were declared was generally irrelevant when working in the DOM. As long as you could get each element in to the correct namespace, you'd be fine.
Rob had taken an almost opposite view, and come up with a bit of practical hackery that blew my theoretical argument out of the water. He'd just escaped the final quote of his namespace value and carried on to add another one. Something like this:
dom.createNode(1, "xxx:child", "xxx"" xmlns:y=""yyy")
(Surprisingly to me) this worked, and allowed him to achieve what he wanted, and I couldn't argue with him because his program was intended to output serialized XML and he wanted it in a particular way. To be fair, a good DOM implementation would provide a clean way to do this. (OK - this is an edge case, and in many other respects MSXML4 is an excellent implementation.)
Getting back to theoretical niceties, I went up and looked up the recommendations for XML namespaces. A namespace name must be a URI as defined in RFC2396, and there the use of double quotes is excluded from use in a URI:
The angle-bracket "<" and ">" and double-quote (") characters are excluded because they are often used as the delimiters around URI in text documents and protocol fields.
So strictly, Rob's hack is cheating :-) , but it got the job done for him. (Note that in this context we don't have to worry about the single-quote character (') because when MSXML4 serializes the Dom, it uses double-quotes to delimit the namespace name.)
The current recommendation, XML Namespaces 1.1, specifies that a namespace name should be an Internationalized Resource Identifier, and the standard for those (currently a draft) doesn't mention double-quotes at all.
Not long after this discussion, I almost got as far as writing about it when Don Box made a posting on a related topic, asking how namespaces ought to be compared for equivalence, as the Microsoft implementations do not perform encoding or unencoding when comparing. The responses were fairly clear. The Microsoft implementation is correct in this respect: Namespaces are compared as strings.
To quote from XML Namespaces 1.1:
IRI references identifying namespaces are compared when determining whether a name belongs to a given namespace,
and whether two names belong to the same namespace. [Definition: The two IRIs are treated as strings, and they are
identical if and only if the strings are identical, that is, if they are the same sequence of characters.]
The comparison is case-sensitive, and no %-escaping is done or undone.
So that's pretty straightforward. When comparing namespaces for equivalence, you don't do anything with the string. Having said that, I think the implementation of MSXML4 is broken. It should check for the double-quote character. Here's a snippet of vbscript that shows why:
Set dom = CreateObject("Msxml2.DOMDocument.4.0") dom.async = false dom.loadXML "<?xml version='1.0'?><root/>" Dim child ' VBScript escapes double-quote as double double-quote Set child = dom.createNode(1, "xxx:child", "xxx"" xmlns:y=""yyy") dom.documentElement.appendChild child dom.setProperty "SelectionNamespaces", "xmlns:a='xxx' xmlns:b='yyy'" ' This XPath will fail, as 'xxx' is not the same as 'xxx"" xmlns:y=""yyy"' If dom.selectSingleNode("//a:child") is nothing then WScript.Echo "Can't xpath to xxx:child" Else WScript.Echo "Can xpath to xxx:child" End if dom.save "c:\temp\tempdom.xml" dom.load "c:\temp\tempdom.xml" ' The rather bizarre namespace name doesn't survive the round-trip to disk, and this XPath succeeds If dom.selectSingleNode("//a:child") is nothing then WScript.Echo "Can't xpath to xxx:child" Else WScript.Echo "Can xpath to xxx:child" End if
Its this inconsistency that makes me say it's broken. The document represented by your DOM should always survive a round-trip via the serialization mechanism. Either the serialization mechanism has to find some way to represent the namespace name with the double-quote in it, (Is there any reason why you couldn't say <xxx:child xmlns:xxx="xxx" xmlns:y="yyy"/>) or you have to regard the double-quote as illegal in a namespace name.
That's just the purist in me speaking. On the other hand, I can't think of a good use case where a mended implementation would be better than the currently broken one. At least the broken version lets Rob serialize how he wants.