January 11, 2007
Patternist, the XQuery/XPath/XSL-T framework, is abstracted to be able to use different tree-implementations, in concept like Saxon. Up until now, Patternist has been using one that wrapped Qt’s QDom. When I started writing that very first tree backend it was with the purpose to boot strap the rest of the code, a temporary solution that got the job done until the solution for production use arrived. QDom’s massive memory usage — my measurements says roughly 18 times the document size — is people’s usual complaint. The reason I stalled was that the XPath Data Model, simply couldn’t be implemented with QDom, let alone efficiently. So what now?
This blog entry is tinkering — although without accompanying code — on how to represent XML.
For the purpose of writing a pleasing backend, I started reading up on the research. There is a lot of work in this area and I have still no trouble finding contributing papers. The focus of the papers I’ve read so far(and if taking a broader perspective, I suspect) is representing XML with respect to storage size and efficient evaluation of XPath paths, with a strong eye on axes.
One thing worth to notice, even though XQuery’s support for static typing of sequences’ cardinality is ardently supported and readily implemented, I fail to see the literature that utilizes it. Well, perhaps that will arrive later on.
Although it may sound obvious, “storing XML” is not a homogen term, which surely must be taken into account when starting to pick and choose between the different proposals. There is a clear dominance towards persistent, relational storage(not that it is surprise to anyone), which tailors the discussions — table normalization, what to index, data base optimizations to exploit, etc — in a direction that typically isn’t that useful for the more old-fashioned in-memory approaches.
However, one do find significant amounts in other directions. For example, Efficient Path Query Processing on Encoded XML discusses how to query streamed XML on resource-constrained devices, and Efficient XPath Axis Evaluation for DOM Data Structures discusses pre/post numbering close to DOM.
A common pattern through all is diversity, which perhaps can be derived from the genericy of XML and its organic structure. People use XQuery and XML in such different ways that ideas poorly travels between different scenarios.
Sometimes I think this goes so far as losing touch with reality; for example, one paper, in the competition of implementing XPath axes the best way, does so at the cost of as much as four full-document indexes and subsequent size ratio. From my impression document size is one of the most common problems people have, judging from for example vendor forums. Even if a certain proposal for axes is very fast, who can in practice take the storage hit it requires?
Therefore, the techniques one decides for is extremely dependent on what the scenarios are. If one is so lucky to constrain what kind of documents one are reading — count out Patternist — the best one can do is perhaps to adapt at runtime.
One of the key elements that the “XML era” of data management brings, is the increased freedom and diversity, which is as concrete as XML’s syntax compared to a tabular format. “While it’s generally very easy to map relational data into XML, trying to map XML, into relational data can be incredibly difficult.“
However, one trend is not to support XML’s unpredictability, but the opposite way, through excluding node types, XPath axes and features such as pessimistic typing of cardinalities. What makes me excited about XML, is this increased flexibility it gives users on behalf of computing power. And that’s the right direction, although it does mean implementors have to make an effort.
For Patternist, the requirements are as follows:
- Implement the XPath Data Model. This means for example that for a given node, it must be able to determine the namespaces in scope, that processing instructions and comments are represented and that text nodes are text nodes, not only the string value of an element. However, CDATA and regular text nodes aren’t distinguished(which is a good thing) and system entities and system entity references are not part of the data model.
- Be XML 1.0/Namespaces 1.0 based, as opposed to 1.1, for a plethora of reasons.
- Support serialization/roundtripping to the degree XSLT 2.0 and XQuery 1.0 Serialization says.
- Memory consumption must for users be acceptable, which means that it must be a radical improvement compared to QDom. However, it’s not the only requirement, and it hence can’t go blind on this at the cost of the other requirements.
- The faster XQuery queries are, the better. For example, if the time of sorting in document order for two arbitrary nodes is linear to the distance between the two in document order(as in the case with QDom and nost other DOM implementations), one can pretty much declare game over.
- It’s about representing and writing arbitrary documents, in arbitrary locations, in arbitrary amounts. For that reason bulk loading into a persistent storage is not an option.
- Updates must be possible to the degree that the XQuery Update Facility requires. Why not even more fine grained write support? Because I believe XQuery Update makes a good job at taking into account users needs while providing implementors with good optionsfor doing things efficiently.
What tradeoffs to do on this game board will be readily discussed. Except for conformance.
Node Numbering Schemes
“Node Numbering is one of the more interesting, and important design choices in an XML database,” says Anatomy of an XML Database: Oracle Berkeley DB XML(which gives a good overview of databases and XML) and I agree. It seems to be the key, and largely affects how the rest is wired.
The reason stems from XPath’s requirement of that results of path expressions should be free from duplicates and be sorted in document order. In order to optimize those operation(de-duplication, sorting), implementors attempt to code the structural relationship(e.g, if a node is a forward-sibling of another, and so forth) in the node idenfifier in order to avoid additional lookups. The otherwise needed indirection is relatively cheap for in-memory models(since it’s typically a pointer being dereferenced, or somekind of array lookup), while for implementations on top of relation databases it requires an additional join for performing a structural join.
One widely popular node numbering scheme is Dietz’s pre-post numbering, which is excellently explained on the MonetDB project’s site. Its major advantage is that the node relationship can be determined in constant time, but does so at the cost of requiring a re-numbering of the whole tree when nodes are inserted.
There are many variants of pre-post numbering, such as replacing the post number with an appromixated child count to reduce the number of re-numberings on updates. The problem with the different variants is that they all have somekind of flaw, still leaving one with a bad taste.
However, the Dynamic Level Numbering(DLN) scheme, nicely explained on eXist’s site, which essentially extend Dewey’s Decimal Classification with support for incremental updates and encodings for efficient storage, seems to be freed from the other schemes’ issues, at the cost of being of variable length. In practice this means that each identifier needs a heap allocation to support the identifier, which is costly considering how node identifers are used.
However, in C++ I believe this can be tackled quite well with a classic
union-trick: it is first at abnormal tree depths or very large node counts that identifiers grows to considerable sizes, meaning one can store it in 63 bits and let the last bit signal whether the first word is used as a pointer to a heap allocated structure(otherwise the identifier is simply stored in the 63 bits).
The XML Web: a First Study, says that of the roughly 200 000 XML documents they scanned, “99% of the documents have fewer than 8 levels. The average depth is 4,” which sounds promising for the DLN scheme. The hit taken by large documents is more realistic, but on the other hand first occurs when having run out of “local” bits.
Compression & Text Nodes
KDE developer Ariya Hidayat reach impressive size ratios in his KoXmlReader experiments with the use of heavy compression(among other things). The requirements for KoXmlReader and an XQuery implementation are vastly different. Although for functionality that is supposed to actually access(query) the nodes, one could quickly conclude that the tradeoff of compression(slow access) is strongly present, but I don’t think that necessarily holds.
I believe it was Michael Kay who somewhere wrote that lazy parsing of source documents could be sensible for XSL-T transforms because often not all nodes are needed(and hence doesn’t need to be represented). By the same reasoning compressing text nodes could be sensible as well. Afterall, selecting a whole document is not a very useful query.
It’s also worth considering what is being compressed. If I’ve read the KoXmlReader experiments correctly, it was compressing western codepoints in UTF-16 encoding, which wouldn’t surprise me if is thankful input for many compression algorithms. Another possibility is to store UTF-8 instead of UTF-16 if that encoding as well as western content is common — A Web Study reporting on encoding distribution would be helpful.
Another possibility is to let the parser not decode at all(performing validity checks only) and store the text in the document’s encoding, which possibly could speed up serialization, which is notably expensive. However, doing so requires rather heavy surgery, and rules out features such as system entities(which on the other hand are absent in the XPath Data Model anyway), to mention one of the few things one can consider.
Another possibility is to compress text nodes consisting of only whitespace and gain by that knowledge when doing string matches as well. However, implementing this with Qt/C++ is tricky, considering QString’s non-polymorphic design. Unfortunately The XML Web: a First Study doesn’t discuss the content of text nodes, which makes it difficult to tell how relevant this discussion is, although folklore and sporadic tests suggests whitespace is of significant size.
One approach is to compensate the difficulty of accessing the compressed nodes with a full-text index, which as upside is designed for actually querying text nodes. Or rather, perhaps the compressed nodes is the payment for a full text feature. One could also attempt to base primitive searches such as fn:contains() on the full-text index. Not that I have started implementing Full-Text.
Name pools, typically implemented as that each name is stored as a string only once and subsequently referred to with an id, are popular both in relation databases and in-memory representations. The reason seems to simply be that structural content is overwhelming in an XML document(as the Web Study discusses in section 3.2 Node Distribution), as well as that name comparisons are optimized. Name pooling seems to one of the most efficient space optimizations one can apply.
Worth to comment on is populating the name pool. Saxon’s NamePool loops over the name index each time an id is requested, in order to determine the index of the previously identical inserted string, if any. This is likely as slow as an insert gets, which others decide to optimize by keeping a hash with the string index as value.
The index costs space(although not much since it’s linear to the amount of different names, as opposed to count of all names), which possibly could be skipped if one could approximate what nodes that appear often. For example, the
html element in an XHTML element appears only once and the same applies more or less for the following 3-4 elements, but they are nevertheless inserted first in the name index and is therefore iterated over for every following name for no useful reason.
If one could approximate where the common elements starts — the Web Study finds many intriguing, strong patterns — one could offset the first names further up the index, or to store in two different indexes(“rare” and “common”).
Being able to approximate what names that will appear often is also of interest for doing pre-allocations. For example, as can be seen in the xmlstat blog entry, there is about 13 names that occurs a vast amount of times. An index that has name as key and node identifiers as values will likely use growth strategies that are non-optimistic for those names.
File Size as Parameter
For a wide variety of choices, the XML representation consists of integers whose required range they are required to express, depends on the document size. For example, the maximum amount of nodes, name distribution or tree depth. For example, using 64 bits for a document order numbers is excessive if the actual amount of nodes easily fits in 16 bits.
One approach here could be to use the file size(which is relatively cheap to retrieve for a local file which is to be parsed), if such a one is available, as hint for the XML representation which subsequently could use more appropriate storage types. More practically, this could for example be implemented in C++ with a template implementation which is instantiated for a set of common sizes, “small”, “medium”, and “large”.
This space optimization applies for probably small documents, which can be significant for resource constrained devices or when many small documents are opened, such as collection of small documents used as a database(a usage scenario which Apache’s XIndice seems to be optimized for).
The first step would be to determine what parameters in the implementation that are dependent on the document size. For example, to be able to answer the question “What is the maximum file size that can be handled if I store node count in 16 bits?” and other peculiar pussles with the XML syntax.
Even if one decides to trust ones calculations, there is the possiblity of that a file is encoded in a codec with a smaller per-character encoding than a byte(possibly as a malicious attack by installing such a codec), so it is probably a risky business to skip bounds checking for fixed sizes.
File size can also be used as a hint for pre-allocations, especially if ratios for the different stores in the representation are known.
Due to the requirements for Patternist’s backend, indexes aren’t that relevant, except for a particular case: a map from name to attribute and element nodes.
One reason is it allows threaded execution of path expressions such as
c/d can be executed concurrently, to afterwards be joined. Many XPath operators can be threaded in typical implementations(not that I can name any who does) because the operands do not depend on each, but that is the case with path expressions without any such index.
Another reason is the common abuse but also valid use of the descendant axis which should be signficantly faster for many scenarios when using a name index.
It is hard to tell whether the storage for such an index is worth it, for several reasons. For starters, the size of such an index could be of significant size; for the document in the xmlstat blog-entry it would probably be around 2 MB assuming each node identifier consumes 64 bits(although it should be taken into account that the source document is relatively big: 8.5 MB).
Query as Parameter
One way to improve both space and time consumption is to use query analysis to adapt the XML representation for documents to be loaded. Examples follows.
Path expressions tells in a very clear way what nodes that will be used. Hence one can selectively build representations for processing instructions, comments, attributes, and text nodes. For instance, the Web Study concludes that “for documents larger than 4096 bytes, there are proportionally more attribute nodes than element nodes(51.13% vs. 37.83%),” which suggests that significant space gains can be made.
Doing such a query analysis is straight forward, but needs consideration. For instance:
- If positional predicates are present, skipped nodes must be compensated by “artificial” place holders in order to preserve node numberings.
- Access to text nodes or the string value property is commonly implicit through operators and functions, which is one reason why coding as much as possible of an XQuery implementation in XQuery itself is of interest, since it simplifies this kind of analysis(as discussed in a previous blog entry).
Selectively building name indexes based on what’s needed can reduce the associated space costs. For example, if no nametests for attributes are present but the attribute axis is nevertheless used in some way, a name index doesn’t have to be populated for attributes.
If one assumes that the
child axes are relatively fast, a node test such as
name are relatively inexpensive. However
//*[@name] are of a different magnitude. One way to improve the decision for index building is to use weighting, as I call it, of constructs.
For example, considering the high representation of attribute nodes in documents, notably expensive path expressions involving attributes must be present before populating an index with attribute nodes. This could make the cost of a name index bearable. Similarly, if a query containing a path expression with elements tests that can be evaluated by a sequential walk(no complex axes, no sorting or de-duplication) it perhaps neither qualifies for an index build.
Where to Go
Vaporware and speculation is what characterizes this text. Inventing approaches is not difficult, the hard, time consuming but not very interesting part is implementing and measuring. Possibly, for the mere result of realizing that the long path led to a dead end.
A yet wide issue is how users should access the result of a query(as verbosely discussed in XML APIs). A traditional approach as seen in W3C’s XPath DOM Module or XQuery API for Java is the use of iterators. Another approach could be more in the direction of the XProc pipeline language, where for a query one would declare and name output pipes which could be sent to with a
to-pipe("qualified-name-of-pipe", expression) function. I believe it could possibly:
- Allow more operations to be carried out within the query instead of that the user inspects nodes from an iterator, for example
- With an improved understanding of how the results will be used, hence have a stronger foundation for further optimizations
- Designing practical APIs for retrieving the result.
However, I think one thing can be strongly concluded: declarative approaches is the way to go. One reason I am opposed to exposing functionality to users through the traditional tree API or low-level APIs in general is that they kill every attempt at doing clever things, in addition to delegating the task of conforming to standards to the user.
One can’t apply a pipeline/early-exit strategy if the user can retrieve child nodes as a list; one cannot attempt efficient lookups if the user is doing it with a
for-loop; threading becomes the user’s responsibility if a higher perspective is absent; and it is difficult to adapt document building if all the user has told is that a document needs to be loaded, and so on. That’s why I believe XQuery and pipeline languages like XProc are good ways to go.