xmlstat
January 9, 2007
I wrote a small tool for extracting statistics about XML documents. If I was less lazy, it could be more useful. Still, to some use I think it is.
With XML documents, as with many other things, it is difficult to perceive actual circumstances. Hence we construct measuring devices. Abc. Here’s the output from invoking xmlstat
on the catalog file that describes the tests in W3C ‘s XQuery Test Suite:
Document: file:///home/fenglich/kde/src/playground/utils/xmlstat/XQTSCatalog.xml File size(bytes): 8904307 ------------------------------------------- Elements: 69062 Attributes: 170741 Non-Whitespace Text nodes: 32936 Whitespace Text nodes: 80810 Total text-node count: 113746 Comments: 0 Processing Instructions: 2 Total: 239805 ------------------------------------------- XQueryXQueryOffsetPath 1 {http://www.w3.org/2005/02/query-test-XQTSCatalog}contextItem 2 {http://www.w3.org/2005/02/query-test-XQTSCatalog}scenario 4 {http://www.w3.org/2005/02/query-test-XQTSCatalog}input-document 5 {http://www.w3.org/2005/02/query-test-XQTSCatalog}role 6 schema 8 {http://www.w3.org/2005/02/query-test-XQTSCatalog}citation-spec 9 input-document 11 {http://www.w3.org/2005/02/query-test-XQTSCatalog}note 14 {http://www.w3.org/2005/02/query-test-XQTSCatalog}input-URI 23 {http://www.w3.org/2005/02/query-test-XQTSCatalog}context-property 30 {http://www.w3.org/2005/02/query-test-XQTSCatalog}implementation-defined-item 37 namespace 43 {http://www.w3.org/2005/02/query-test-XQTSCatalog}input-query 44 {http://www.w3.org/2005/02/query-test-XQTSCatalog}source 49 {http://www.w3.org/2005/02/query-test-XQTSCatalog}module 58 FileName 75 ID 77 featureOwner 94 last-mod 138 {http://www.w3.org/2005/02/query-test-XQTSCatalog}test-group 358 {http://www.w3.org/2005/02/query-test-XQTSCatalog}title 359 {http://www.w3.org/2005/02/query-test-XQTSCatalog}expected-error 1141 {http://www.w3.org/2005/02/query-test-XQTSCatalog}output-file 9638 {http://www.w3.org/2005/02/query-test-XQTSCatalog}test-case 10519 is-XPath2 10544 date 10563 Creator 10585 {http://www.w3.org/2005/02/query-test-XQTSCatalog}input-file 10636 variable 10703 {http://www.w3.org/2005/02/query-test-XQTSCatalog}description 11380 section-title 13823 spec 13860 role 20303 name 21540 Total name count: 71 ------------------------------------------- {http://www.w3.org/2005/02/query-test-XQTSCatalog} {http://www.w3.org/2001/XMLSchema-instance}xsi Total bindings count: 2 ------------------------------------------- Non-Whitespace chars: 1346407 Whitespace chars: 1706529
While raw and with blunt and confusing labeling, there are some things one can see:
- The document has in total 71 different qualified names, but they appear almost a quarter of million, 239803 times(element + attribute count).This is partly why namepools are popular both among relational storage models as well as tree implementors, since they reduce memory usage by an indescribable amount.
- The attribute count is more than twice as big as the element count. That put things in perspective, at least for me. That relationship is a bit off in a broader scale, but not that far, as we shall see.
- Text nodes containing only whitespace(always formatting basically) is larger than actual content(
Non-Whitespace chars
versusWhitespace chars
).
These “conclusions” aren’t that interesting since they’re only based on one document, but I nevertheless state them because they roughly mimmic what The XML Web: a First Study says.
That paper contains statistics over XML documents found on the web, but is for these purposes relatively old(from 2003). I would for that reason not be surprised if the numbers has changed since then, considering the widespread use of feeds, for instance. Another point is that their statistics in some cases are quite heavily influenced by individual sites(such as http://rpmfind.net/ or http://w3.org/), which to me suggests that a too small data sampling is used.
Here’s a very concentrated list of their conclusions:
- “WAP and RDF make up 26% and 17% of all document on the XML Web, respectively.”
- “Our results show that XML documents are in fact relatively shallow: 99% of them have less than 8 levels of element nesting. Also, 15% of the documents we analyzed have recursive content, in which there is much regularity.”
- “Only 75 different DTDs are referenced in our sample”(which is about 200,000 documents). “92% of all DTD references are made to norms 1.1 or 1.2 of the WAP protocol.”
- “Only 0.09% of the document suse either the attribute label SchemaLocation or noNameSpaceSchemaLocation”(but as in the case with DTDs, not referencing a schema from the document is not equivalent to not using a schema for that document).
- “For documents up to 4096 bytes, the number of element nodes dominates the distributions”
- “For documents larger than 4096 bytes, there are proportionally more atttribute nodes than element nodes(51.13% vs. 37.83%).”
- “These observations led us to conclude that the structural information found in XML documents is in fact dominant over the textual content.”
- “It turns out that 782,602 elements(5% in total) have mixed content. Surprisingly, these elements belong to 138,298 documents(72% of all documents).”
- “The prevailing assumption in this community[database community] is that attributes and mixed element content are not as important as element content.”
- “99% of the documents have fewer than 8 levels[level refers to tree depth]. The average depth is 4, and the deepest document has 135 levels.”
- “On average, the second level contains more attributes than any other level. In fact, 89% of all attribute are found in the first 3 levels of the documents.”
- “77% of all element nodes and 6% of all text are found in the first 3 levels of the documents.”
- “28,208 XML document(14.81% of the total) contain recursive elements.”
- “The average document size is 4kb”
I find statistics like these very useful and I believe they can play an important role in discussing implementation approaches.
I would surely not mind a second such study. Perhaps it should have a larger document sampling in order to not be thrown off by individual sites. It would also be nice to see the distribution of encodings used, and the relationship between whitespace-only and regular text nodes.
And of course, xmlstat
could be a lot more useful. Essentially that an XHTML page is produced with bar charts describing name distributions, node type distributions, concentrations in relation to depth, and so on. It would help with making decisions for implementations. It wouldn’t surprise if it’s useful for things like XSL-T debugging as well.
In either case, feel free to use or improve this simple and primitive tool. It’s in KDE’s SVN repository, playground/utils/xmlstat
, GNU GPL licensed and based on QtCore, QtXml and qmake.
April 22, 2009 at 1:38 pm
hm. love it 🙂