xmlstat

January 9, 2007

I wrote a small tool for extracting statistics about XML documents. If I was less lazy, it could be more useful. Still, to some use I think it is.

With XML documents, as with many other things, it is difficult to perceive actual circumstances. Hence we construct measuring devices. Abc. Here’s the output from invoking xmlstat on the catalog file that describes the tests in W3C ‘s XQuery Test Suite:

Document:         file:///home/fenglich/kde/src/playground/utils/xmlstat/XQTSCatalog.xml
File size(bytes): 8904307
-------------------------------------------
Elements:                  69062
Attributes:                170741
Non-Whitespace Text nodes: 32936
Whitespace Text nodes:     80810
Total text-node count:     113746
Comments:                  0
Processing Instructions:   2
Total:                     239805
-------------------------------------------
XQueryXQueryOffsetPath                                                        1
{http://www.w3.org/2005/02/query-test-XQTSCatalog}contextItem                 2
{http://www.w3.org/2005/02/query-test-XQTSCatalog}scenario                    4
{http://www.w3.org/2005/02/query-test-XQTSCatalog}input-document              5
{http://www.w3.org/2005/02/query-test-XQTSCatalog}role                        6
schema                                                                        8
{http://www.w3.org/2005/02/query-test-XQTSCatalog}citation-spec               9
input-document                                                                11
{http://www.w3.org/2005/02/query-test-XQTSCatalog}note                        14
{http://www.w3.org/2005/02/query-test-XQTSCatalog}input-URI                   23
{http://www.w3.org/2005/02/query-test-XQTSCatalog}context-property            30
{http://www.w3.org/2005/02/query-test-XQTSCatalog}implementation-defined-item 37
namespace                                                                     43
{http://www.w3.org/2005/02/query-test-XQTSCatalog}input-query                 44
{http://www.w3.org/2005/02/query-test-XQTSCatalog}source                      49
{http://www.w3.org/2005/02/query-test-XQTSCatalog}module                      58
FileName                                                                      75
ID                                                                            77
featureOwner                                                                  94
last-mod                                                                      138
{http://www.w3.org/2005/02/query-test-XQTSCatalog}test-group                  358
{http://www.w3.org/2005/02/query-test-XQTSCatalog}title                       359
{http://www.w3.org/2005/02/query-test-XQTSCatalog}expected-error              1141
{http://www.w3.org/2005/02/query-test-XQTSCatalog}output-file                 9638
{http://www.w3.org/2005/02/query-test-XQTSCatalog}test-case                   10519
is-XPath2                                                                     10544
date                                                                          10563
Creator                                                                       10585
{http://www.w3.org/2005/02/query-test-XQTSCatalog}input-file                  10636
variable                                                                      10703
{http://www.w3.org/2005/02/query-test-XQTSCatalog}description                 11380
section-title                                                                 13823
spec                                                                          13860
role                                                                          20303
name                                                                          21540
Total name count:                                                             71
-------------------------------------------
{http://www.w3.org/2005/02/query-test-XQTSCatalog}
{http://www.w3.org/2001/XMLSchema-instance}xsi
Total bindings count: 2
-------------------------------------------
Non-Whitespace chars: 1346407
Whitespace chars:     1706529

While raw and with blunt and confusing labeling, there are some things one can see:

  • The document has in total 71 different qualified names, but they appear almost a quarter of million, 239803 times(element + attribute count).This is partly why namepools are popular both among relational storage models as well as tree implementors, since they reduce memory usage by an indescribable amount.
  • The attribute count is more than twice as big as the element count. That put things in perspective, at least for me. That relationship is a bit off in a broader scale, but not that far, as we shall see.
  • Text nodes containing only whitespace(always formatting basically) is larger than actual content(Non-Whitespace chars versus Whitespace chars).

These “conclusions” aren’t that interesting since they’re only based on one document, but I nevertheless state them because they roughly mimmic what The XML Web: a First Study says.

That paper contains statistics over XML documents found on the web, but is for these purposes relatively old(from 2003). I would for that reason not be surprised if the numbers has changed since then, considering the widespread use of feeds, for instance. Another point is that their statistics in some cases are quite heavily influenced by individual sites(such as http://rpmfind.net/ or http://w3.org/), which to me suggests that a too small data sampling is used.

Here’s a very concentrated list of their conclusions:

  • “WAP and RDF make up 26% and 17% of all document on the XML Web, respectively.”
  • “Our results show that XML documents are in fact relatively shallow: 99% of them have less than 8 levels of element nesting. Also, 15% of the documents we analyzed have recursive content, in which there is much regularity.”
  • “Only 75 different DTDs are referenced in our sample”(which is about 200,000 documents). “92% of all DTD references are made to norms 1.1 or 1.2 of the WAP protocol.”
  • “Only 0.09% of the document suse either the attribute label SchemaLocation or noNameSpaceSchemaLocation”(but as in the case with DTDs, not referencing a schema from the document is not equivalent to not using a schema for that document).
  • “For documents up to 4096 bytes, the number of element nodes dominates the distributions”
  • “For documents larger than 4096 bytes, there are proportionally more atttribute nodes than element nodes(51.13% vs. 37.83%).”
  • “These observations led us to conclude that the structural information found in XML documents is in fact dominant over the textual content.”
  • “It turns out that 782,602 elements(5% in total) have mixed content. Surprisingly, these elements belong to 138,298 documents(72% of all documents).”
  • “The prevailing assumption in this community[database community] is that attributes and mixed element content are not as important as element content.”
  • “99% of the documents have fewer than 8 levels[level refers to tree depth]. The average depth is 4, and the deepest document has 135 levels.”
  • “On average, the second level contains more attributes than any other level. In fact, 89% of all attribute are found in the first 3 levels of the documents.”
  • “77% of all element nodes and 6% of all text are found in the first 3 levels of the documents.”
  • “28,208 XML document(14.81% of the total) contain recursive elements.”
  • “The average document size is 4kb”

I find statistics like these very useful and I believe they can play an important role in discussing implementation approaches.

I would surely not mind a second such study. Perhaps it should have a larger document sampling in order to not be thrown off by individual sites. It would also be nice to see the distribution of encodings used, and the relationship between whitespace-only and regular text nodes.

And of course, xmlstat could be a lot more useful. Essentially that an XHTML page is produced with bar charts describing name distributions, node type distributions, concentrations in relation to depth, and so on. It would help with making decisions for implementations. It wouldn’t surprise if it’s useful for things like XSL-T debugging as well.

In either case, feel free to use or improve this simple and primitive tool. It’s in KDE’s SVN repository, playground/utils/xmlstat, GNU GPL licensed and based on QtCore, QtXml and qmake.

One Response to “xmlstat”

  1. Incelitte Says:

    hm. love it 🙂


Leave a comment