January 8, 2007
I’ve been reading research papers about XQuery recently and I am impressed. I’ve always had the impression that the amount of papers have significantly increased during the XPath 2.0/XQuery 1.0 “era”, but my conviction that the organic nature of XML is hopeless to query and store efficiently has withstood until now — to mention one of the few interesting discoveries I’ve done while scanning papers.
But that’s the positive side of it.
The major problem is that scientists seems to live in somekind of denial.
For example, XMill: an Efficient Compressor for XML Data discusses various techniques for efficiently compressing XML documents. It compares a log of HTTP traffic compressed and uncompressed in a custom text format, to the same data stored in an XML format, and discusses the effectiveness of different compression techniques applied on the custom format, the XML document, and the XML document. Their compression applied on the XML document(which is a lot more verbose than the custom text format) yeilds a smaller file than the custom format compressed, which is impressive — nice numbers.
The problem is that in order to reach those compression ratios, one has to use custom compression algorithms and in addition feed a configuration file that teaches the compressor about the XML format being compressed. In other words, it turns the XML document into a custom, binary format which counteracts the precise reason to why XML was chosen in the first place. So no matter how good those compression ratios are, they can’t be deployed except for within a closed, proprietary network, and hence they will most likely stay on that paper.
Another problem is presentation. Many papers spends a good portion of their text translating terms into formalism and simply other names for things. Sometimes this formalism is necessary, but it always results in requiring quite some effort to decypher a single paper’s home made symbols, conventions and terms. If all I did was reading papers I presume it wouldn’t be problematic, but until that happens it is.
W3C XML Schema perhaps gets criticized the most for being an academic exercise. That’s for example why I (relatively) enjoyed reading XProc: An XML Pipeline Language. Even though its subject is technical, it doesn’t try to sound so or academic. It adapts to me as a reader.
A related topic is how good papers are at explaining their theories. Some papers I grasp after the first read, while others I am still trying with. The same applies to Wikipedia entries. Is my brain having a latent glitch? Can it all be explained as that some theories are inherently more difficult to express? I’d say that a scientist’s success is not only dependent on what he quotes or factual data he or she have managed to consume, it also depends on this person’s pedagogical skills.
I have lost count of how many papers I’ve read that somewhere have sneaked into their discussion about algorithms for axes and storing XML, that processing instructions, comments, namespace instructions, text nodes or something else is ignored. Hello? I’m sure that’s a very comfortable way to do things, but it’s quite difficult to implement that super nifty storing scheme or that super cool axis algorithm for XQuery if the cost is that… you don’t implement XQuery.
The same can be seen with some XML parser “optimizations”. When it turns out that it’s done at the cost of not conforming to the XML specification, it’s not impressive any longer. It’s just wrong. It’s not cool to write an algorithm or implement an optimization that ignore reality. It’s cool to play by the rules and on top of that bring actual improvements, instead of bending the world towards what is practical for the researcher.