This article is more than 1 year old
Writing history with Microsoft's Office lock-in
No XML please, we're arbitrary
Sometimes, very small decisions can have a very big impact on how people work in the future. So join us, on a journey into the future: a story that begins with a little fudge.
In a little noticed move, Microsoft has slid on its commitment to produce open standard file formats for its Office products.
By maintaining a proprietary binary format that frequently changes, Microsoft has kept the exit costs high for potential defectors. However, Microsoft has for a long time touted its investment in XML as a sign of its commitment to openness.
You must remember that XML has always had a "feature" which distinguishes it from SGML, its much more complicated publishing predecessor. SGML insisted on leaving nothing to chance, but an XML parser can, by using a DTD (Document Type Definition) file, happily munch its way through a "well formed" XML document schema, leaving many entities which have not been defined alone.
"Well formed" means that the document will parse without errors - it doesn't mean that the document will make any sense.
Some of our schemas are missing
Microsoft has made a curious choice. It has backed away from implementing an OASIS-defined industry standard by flying a populist flag. Microsoft will offer "freedom" to its users by letting them roll their own schemas.
Microsoft has done so by playing a six-cup shell game. There will be six versions of Microsoft Office 2003, but only two will support user-defined schemas. Can you guess under which two cups the schemas are hiding?
We'll tell you. Office Enterprise and Office Professional. As Joe Wilcox notes in this article, it's the first time such important functionality has been isolated in one variant of the suite.
For the rest of the time, you will be using Microsoft's own schema, WordML. But this is only open in the sense that XML is open.
So when you read a statement from Redmond (via Joe) that, "...when you are using Word in Office XP or the Standard version of Office 2003, the WordML--Microsoft's XML schema, which is 100 percent compliant with industry standards for XML--is saving the formatting of the Word doc," you can hear the sound of a wooden nose growing [*].
A splendid summary of the state of affairs can be found at XML Deviant , a column penned by Kendall Grant Clark.
Clark cites Mike Champion, who asks, "what is the point of storing data in XML if the schema [WordML] is so hideous and proprietary than no one can use it without proprietary API support? "
So in the future, you may be faced with two flavors of nonsense. XML Word documents that have been mangled by Microsoft's XML-creation tools, and XML Word documents that have been mangled by users who add their own non-standard entities (such as our Top Secret "VULTURE" tag).
Put your hands where we can see them
Now then. Microsoft argues, with some justification, that its binary Office format is superior technology to "open" and interoperable Unix file systems. The Unix people have barely got round to even starting discussing a Peace Process for Metadata. Microsoft offers a richer format: it supports multiple data streams, and allows all kinds of interesting compound documents to be created.
But if Microsoft had taken note of the responsibilities that go with the power it wields, it would have documented the format and submitted it to a recognized standards body. It could then compete on its own skills as the best implementer of its home grown format.
No XML please, we're arbitrary
(Kendall's must-read column goes onto other areas, such as the quality of WordML, and the market power that Microsoft as a producer of XML content will have on the language, which is an interesting discussion in itself)
The user defined schemas come with a very curious choice of name.
Forgive us for taking part in what looks like a semantic Jihad in recent weeks - yes, there other useful ways of looking at the world - but sometimes the choice of language tells us a lot.
Microsoft calls these user defined schemas "arbitrary schemas".
Remember me not
A very telling quote in Joe's piece comes from Jean Paoli, XML tribal elder and Microsoft's man in XML-land.
Paoli appears to have given up the pretense of Microsoft using XML as a document format at all.
"I'm out of the business of creating formats. Our focus on Office is on data exchange."
Data exchange. There's a good subject.
Let's add the factor "time" into the context. It's already quite hard for you to read EBCDIC documents, unless you have terminal access to an IBM mainframe - or the right IBM mainframe - as there were several EBCDICs and not all were compatible with each other. (Sound familiar?)
Simon Phipps, who works for Sun but here is speaking for himself, making an important point:
" We continue to live in a world where all our know-how is locked into binary files in an unknown format. If our documents are our corporate memory, Microsoft still has us all condemned to Alzheimer's."
He has identified that if we want our data to live on, we need Microsoft to live on too, to help us read it.
So regarding data exchange, who is exchanging what with whom here?
We need our history and our historians. And by ensuring data formats are vendor specific, we're already defining the constraints under which future historians will operate. ®
[*] Creative readers are encouraged to submit entries for what this may sound like, please - no files larger than 35kb.