Generating Fast Validating XML Processors

(Extended Abstract)

Kristoffer Rose
LIP, ENS-Lyon
krisrose@debian.org

Abstract

We present FleXML, a program that generates fast validating XML processors from `self-contained' XML DTDs. It uses the flex (lexical analyser generator) program to translate the DTD into a finite automaton enriched with a stack with the `element context'. This means that the XML processor will act directly on each character received. The program is freely redistributable and modifyable (under GNU `copyleft').

Keywords

Validating XML, DTD, lexical analysis, finite automata.

Overview

The `X' of XML stands for Extensible [XML]. This signifies that each and every XML document specifies in its header the details of the format that it will use and may change its format a bit relative to the used base format.

This has influenced the tools available to write validating XML processors: they use a call-back model where the XML processor passes strings with the tags and attributes names and values to the application. These call-backs must be generic because one cannot know whether a document will start by extending its own notation with more tags and attributes. For well-formed but non-validated XML documents this makes a lot of sense, of course, but we would in general like to exploit the information in the DTD for optimizations.

In particular, for many applications a fixed format suffices in the sense that a single DTD is used without individual extensions for a large number of documents. In that case we can do much better because the possible tags and attributes are static.

We have implemented an XML processor generator using the Flex scanner generator that produces deterministic finite automata [ASU]. Which means that there is almost no overhead for XML processing: the generated XML processors read the XML document character by character and can immediately dispatch the actions associated with each element (or reject the document as invalid).

Furthermore we have devised a simple extension of the C programming language that facilitates the writing of `element actions' in C, making it easy to write really fast XML applications. In particular we represent XML attribute values efficiently in C when this is possible, thus avoiding the otherwise ubiquitous conversions between strings and data values.

FleXML is available for free (from SourceForge). In this paper we present FleXML through an elaborated example and discuss some of the technical issues.

What can it do?

Assume that we have an XML document my-joke.xml containing the classical joke

<!DOCTYPE joke SYSTEM "my.dtd">
<joke><line>My appartment is so small</line> <suspense/>
<line type='punch-line'>the mice are round-shouldered</line></joke>

(and many others like it, of course). We wish to create an XML processor to validate it with respect to the DTD in the file my.dtd containing

<!-- my.dtd: Small DTD for jokes (just for fun). -->
<!ELEMENT joke (line,(line|suspense)*)>
<!ELEMENT line (#PCDATA)>
<!ATTLIST line type (normal|question|punch-line) 'normal'>
<!ELEMENT suspense EMPTY>

and, furthermore, we wish to write an XML application for displaying such messages in an amusing way.

With FleXML this can be done by creating an `actions file' my-show.act which implements the desired actions for each element. The remainder of this section explains the contents of such an actions file.

An actions file is itself an XML document which must begin with

<!DOCTYPE actions SYSTEM "flexml-act.dtd">
<actions>

(the flexml-act.dtd DTD is part of the FleXML system and is reproduced in the manual page.

We decide that our application should react to a line element by printing the text inside it, and that it should differentiate between the three possible `type' attribute values by printing corresponding trailing punctuation.

This introduces a slight complication, because the attribute values are available when parsing the start tag whereas the element text is not available until we parse the end tag (where it has been read).

This means that we must declare a top-level variable.

<top><![CDATA[
char* terminator = ".";
]]></top>

Notice how we use CDATA sections to make sure that all characters (including white-space) are passed unmodified to the C compiler.

With this we can write the action to set it when reading the line start tag as

<start tag='line'><![CDATA[
  switch ( {type} ) {
    case {!type}: terminator = "..."; break;
    case {type=normal}: terminator = "."; break;
    case {type=question}: terminator = "??"; break;
    case {type=punch-line}: terminator = "!!"; break;
  }
]]></start>

The idea is that the enumeration attribute type is implemented in C as if it had been declared by

enum { {!type}, {type=normal}, {type=question}, {type=punch-line} } {type};

(understanding the {...} units as C identifiers), hence the possibility of using the fast C switch statement to pick the right choice directly. Note that the first choice, {!type}, corresponds to not setting the attribute; in this example the attribute has a default value so this can never happen, however, we include the choice anyway to prevent the C compiler from issuing warnings about missing choices in switch statements.

With this in place we can write the action for </line>. Since it prints something, however, we first need to add the appropriate header inclusion.

<top><![CDATA[
#include <stdio.h>
]]></top>

<end tag='line'><![CDATA[
  printf("%s%s\n", pcdata, terminator);
]]></end>

Finally, we will make the application amusing by `displaying' the <suspense/> empty tag as a short delay; this also involves a header inclusion.

<top><![CDATA[
#include <unistd.h>
]]></top>

<start tag='suspense'><![CDATA[
  sleep(2);
]]></start>

That is all; the only remaining thing is to terminate the action file properly.

</actions>

We can now run FleXML with the DTD and the actions file as input and will get an XML application as output that, when run (after processing by flex and a C compiler), will indeed print

My appartment is so small.
the mice are round-shouldered!!

as expected, pausing duly for two seconds between the lines. On the authors system the above output was achieved with the command sequence

flexml -A -a my-show.act my.dtd 
flex -omy-show.c my-show.l
cc -omy-show my-show.c
./my-show <./my-joke.xml

(see the manual page for the exact meaning of the FleXML options).

An important aspect of the design of FleXML is that the only thing that should matter to the programmer should be the complexity of the application, not of the used DTD. As an example the following action file prints the href attribute of all hyperlinks in an XHTML document:

<!DOCTYPE actions SYSTEM "flexml-act.dtd">
<actions>

<top><![CDATA[
#include <stdio.h>
]]></top>

<start tag='a'><![CDATA[
if ({href}) printf("%s\n", {href});
]]></start>

</actions>

which was compiled into a running application on the author's system with the commands

flexml $(FLEXDEBUG) -rhtml -p'-//IETF//DTD XHTML 1.0 Transitional//EN' \
  'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'
gcc -Wall -ansi -pedantic -O2 -g -c xhtml-href.c -o xhtml-href.o
flex -Bsv -Ca -oxhtml1-transitional.c xhtml1-transitional.l
gcc -Wall -ansi -pedantic -O2 -g   -c xhtml1-transitional.c -o xhtml1-transitional.o
gcc -Wall -ansi -pedantic   xhtml-href.o xhtml1-transitional.o   -o xhtml-href

generating the XML processor directly from the official DTD on the web (which in fact required a patch to flex to enlarge the possible table size).

How does it work?

FleXML is a perl script [Perl] which reads and interprets a DTD and subsequently produces an XML processor source file for the flex scanner and optionally an XML application with the C source of the element actions from an actions file. The DTD is used to construct the rules used by flex to match the individual XML components in such a way that only valid documents match.

For example, the flex code for scanning an attribute declaration of the line tag is the following:

<AL_line>{
 "type"{Eq}"'normal'"  |
 "type"{Eq}"\"normal\""  A_line_type = A_line_type_normal;
 "type"{Eq}"'question'"  |
 "type"{Eq}"\"question\""  A_line_type = A_line_type_question;
 "type"{Eq}"'punch-line'"  |
 "type"{Eq}"\"punch-line\""  A_line_type = A_line_type_punch_d_line;

 ">" {
  LEAVE; STag_line(); pcdata = BUFFERSET(pcdata); ENTER(IN_line);
 }
 "/>" {
  LEAVE; STag_line(); pcdata = ""; ETag_line();
  switch (YY_START) {
   case ROOT_line: SET(EPILOG); break;
   case S_joke: SET(S_joke_1); break;
   case S_joke_1: case S_joke_2: case S_joke_3: SET(S_joke_3); break;
}}}

(with {Eq} an alias for the regular expression matching an equal sign (corresponding to production `[25] Eq' of the XML specification).

This reads as follows: when the XML processor is reading the attribute list of the line tag, i.e., when it is in the <AL_line> state, a `t' will enter an internal state where a `y' proceeds to another internal state but other characters makes the document invalid (because no rule permits it). Once the equal sign has been scanned, the next characters determine the attribute value, and at the end one of the flex actions is performed, setting the attribute value (A_line_type is the real C for what we wrote as {type}, etc.). The important thing is that one can ensure by careful tuning of the flex rules that a valid document will proceed only by looking each character up in a table and determining the subsequent `state' and `action'. One must avoid pairs of rules such as

"-->"		LEAVE(COMMENT);
.		SKIP;

(a single `.' matches any character) because they mean that the scanner will not be sure after having read a `-' character whether it is part of a comment terminator or `just' a dash. In such cases an extra rule must be inserted because for the set

"-->"		LEAVE(COMMENT);
"--"            |
.		SKIP;

the problem goes away.

After the actual attribute rules, two rules handle termination of the attribute list. There are two cases corresponding to whether we just read a start tag or an empty element. In case it was a start tag then we must enter the `inner' mode of the element called IN_line for the line element. The switch handles the state changes needed for the line element resulting from the fact that the element can appear in different contexts. This is always possible to construct because of the requirement that an XML DTD must be deterministic: we just need an element content stack (this is what the LEAVE and ENTER macros are for).

Why is it useful?

In comparison with the forthcoming XML Style-sheet Language [XSL] our approach is much more primitive for better and worse: only a limited range of applications can be produced with FleXML but those will be very fast.

This is useful for XML applications that are meant to process a large number of documents in a fixed format. One such application is the NTSys u-payment transaction server which is implemented as an Apache module where performance is of premium importance. Using FleXML permits a highly modular development of modules for the various transaction types based on a common DTD and a collection of applications that are generated separately and all linked together with the common processor.

FleXML is still under development: please try it out (either from SourceForge or from the Debian GNU/Linux distribution where FleXML is include from release 2.2. The author would welcome comments as to how the system can best evolve. Two things that are definitely on the agenda is a limited `context condition' language for expressing constraints on the position of an element in a document (corresponding to the Xpath subset of XSL), and an efficient way to combine several DTDs into one large t facilitate general XML servers (that can even dispatch to a generic XML interface in cases where the FleXML restrictions are not respected by a document).

Acknowledgements

I am grateful to NTSys for supporting the development of FleXML. Finally extend my sincere thanks to Jef Poskanzer, Vern Paxson, and the rest of the flex maintainers for a great tool.

References

ASU: Alfred Aho, Ravi Sethi and Jeffrey Ullman: Compilers: Principles, Techniques and Tools, Addison-Wesley (1986).
Flex: Jef Poskanzer, Vern Paxson, et. al.: Flex - fast lexical analyzer generator.
Perl: Larry Wall, Perl - Practical Extraction and Report Language.
XML: Extensible Markup Language (XML) 1.0 (W3C Recommendation REC-xml-1998-0210).
XSL: Extensible Stylesheet Language (XSL) (W3C Working Draft).