XProc is Interesting

October 18, 2006

If you’ve built solutions with XML technologies you know it. You have your nice set of Docbook sources that XInclude pulls together that a set of schemata subsequently validates, followed by a XSL-T transform that writes out PDF and XHTML output. This was all done in an platform/implementation indendependent, safe way, except for the hacky script that glued together these steps.
That’s the problem. Content-oriented XML processing tends to turn into a set of steps. While each step is carried out elegantly, you still need to glue it together with some old-fashioned approach. I often write articles in Docbook, that Makefiles sprinkled over my private repository attempts to process. Other use bash scripts, and other horribly non-portable “solutions.”

As one might guess, it is now appropriate to queue the cavalry music.

With the publishing of the first draft of XProc: An XML Pipeline Language, there’s now a solution growing for instructing work flows. Norman Walsh has written a nice summary. Here is his rather self explanatory example reposted for convenience:

<p:pipeline xmlns:p="http://www.w3.org/2006/09/xproc"
name="pipeline">
<p:declare-input port="document"/>
<p:declare-input port="schema"/>

<p:declare-input port="stylesheet"/>
<p:declare-output port="result" step="transform" source="result"/>
<p:step name="xinclude" type="xinclude">
<p:input port="document" step="pipeline" source="document"/>
</p:step>

<p:step name="validate" type="validate">
<p:input port="document" step="xinclude" source="result"/>
<p:input port="schema" step="pipeline" source="schema"/>
</p:step>

<p:step name="transform" type="xslt">
<p:input port="document" step="validate" source="result"/>
<p:input port="stylesheet" step="pipeline" source="stylesheet"/>
</p:step>
</p:pipeline>

By invoking an implementation on a pipeline, the steps are carried out without fuzz. Pipelines can be pretty advanced too, with conditionals and error handling. It should probably also bring nice performance improvements since an implementation can see the process flow and therefore re-use memory representations, skipping costly serialization/tree-building steps as a result.

I’ll probably have a look at implementing XProc, once I’ve finished(tm) XSL-T and XQuery. Doesn’t seem that hard to implement a conformant(but not necessarily fast and fancy) implementation. Hooking into KDE’s KIO such that the strong IO abstraction is achieved would probably make XProc pipelines even more interesting.

I’ve just finished reviewing the spec and if it wasn’t for the trouble of subscribing to the public-xml-processing-model-comments@w3.org mailing list, I would send off the issues I’ve scribbled down.

Considering the age of the XProc draft, it has grown quickly in terms of size and maturity. Nevertheless, I see three large things that could use some thought.

Support OASIS XML Catalogs

An OASIS Catalog is essentially an XML file that specifies how to rewrite URIs to other URIs — it’s a redirect/abstraction layer. Catalogs allows documents(such as a Docbook source referencing a DTD or an XSL-T stylesheet importing another stylesheet) to be written that reference documents at standardized locations or that reference with abstract URIs, to be re-written into local URIs. For instance.

Therefore catalogs possibly speed things up, and make documents more universal since hard coded file names break when processed on other setups. Norman has written a nice user-oriented piece on catalogs.

I think an absence of catalog support would quickly be noticed and call for non-interoperable approaches. I’m not worried though; I think that whatever the outcome, it is probably wise. OASIS Catalogs aren’t alien to the working group(who’ve edited both XProc and OASIS Catalogs?).

I think catalogs could lead to interesting solutions, where they would act like a mechanism similar to C++’s templates. One could write a generic pipeline, that takes only one argument: a catalog. A catalog that turns the pipeline into a concrete implementation.
It wouldn’t surprise me if it would be tricky to specify. Perhaps it would be a construct(like a group) whose rewrite rules applies to its contained components. Questions remains for what an OASIS Catalog actually affects: import and include statements in stylesheets? Calls to fn:document(), fn:collection() and functions the like? Should it be configurable? Having standardized what a catalog applies to would indeed be practical.

The Data Model

XSL-T 2.0 and XQuery takes a big step forward and (finally) introduces typed languages to the XML world. XProc is at this point a bit vague and use a data model reminding of a subset of XPath 1.0. I find this a pity because it forces queries and stylesheets to sink to the same level as XProc by dropping type annotations. It’s counterproductive in the big picture.

I wonder whether XProc can “get away” with this in the long perspective. Once XProc 1.0 is released, will not a more rigid type system be asked for after a short period of time?

One approach is to require the XPath 2.0 data model(and its type system), but leave import of user types for a later version(if interest arrises at all, of course). This makes it possible to use the basic types, which probably goes a long way. Instead of using the sequence="yes|no" attribute, which to me feels a bit ad-hoc and limited, one would have an attribute named type, containing a SequenceType.

If support for the XPath 2.0 Data Model was made optional(along with probably the XQuery and XSL-T 2.0 step, in that case), the only requirement could be to understand the types document-node() and document-node()*.
On the other hand, no one wants levels of conformance, for good reasons. The problems with switching between XPath 1.0/2.0, experienced from XSL-T 2.0 & XQuery, surely should be avoided.

When is it ok to reference the second generations of XPath and XSL-T? Will it be less problematic if a second version of XProc introduce compatibility issues, as opposed to simply going for XPath 2.0 directly?

Re-use More

It feels like XProc in a number of ways is too elaborate. For example, specifying input like this:

<p:input port="document" href="http://example.org/input.html" select="//html:div"/>

Could be simplified by skipping the href attribute and do:

<p:input port="document" select="document('http://example.org/input.html')//html:div"/>

But I presume it is as it is because XPath 1.0 doesn’t allow arbitrary expressions in node steps.

If it wasn’t for that XQuery can’t be a requirement, one could simplify things further by skipping the href and src attribute, and instead let the element content be an arbitrary XQuery query.

Another thing is serialization: the standard component store is essentially a duplication of XSL-T’s serialization functionality. Serialization is of high importance to users who often want to customize the output(indentation levels and what not) leading to vendors implementing extensions.

By duplicating this serialization functionality, vendors will have to provide such extensions once more, delaying adoption and interoperability. A more compact approach would be to simply let users use an XSL-T step for writing files. In that way feature creep is avoided, the spec is simplified, and there’s a wealth of methods to choose from: XSL-T 1.0, EXSLT and XSL-T 2.0. The question is if the simple, typical scenario becomes too messy to write.

One Response to “XProc is Interesting”

  1. jssilver Says:

    EMC has just released two new products that will help to advance the XProc spec: the first commercial XProc processing engine and the first graphical design tool for XML pipelines. Both are available free of charge for development purposes. http://developer.emc.com/xmltech


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: