We present FleXML, a program that generates fast validating XML processors from `self-contained' XML DTDs. It uses the flex (lexical analyser generator) program to translate the DTD into a finite automaton enriched with a stack with the `element context'. This means that the XML processor will act directly on each character received. The program is freely redistributable and modifyable (under GNU `copyleft').
Validating XML, DTD, lexical analysis, finite automata.
The `X' of XML stands for Extensible [XML]. This signifies that each and every XML document specifies in its header the details of the format that it will use and may change its format a bit relative to the used base format.
This has influenced the tools available to write validating XML processors: they use a call-back model where the XML processor passes strings with the tags and attributes names and values to the application. These call-backs must be generic because one cannot know whether a document will start by extending its own notation with more tags and attributes. For well-formed but non-validated XML documents this makes a lot of sense, of course, but we would in general like to exploit the information in the DTD for optimizations.
In particular, for many applications a fixed format suffices in the sense that a single DTD is used without individual extensions for a large number of documents. In that case we can do much better because the possible tags and attributes are static.
We have implemented an XML processor generator using the Flex scanner generator that produces deterministic finite automata [ASU]. Which means that there is almost no overhead for XML processing: the generated XML processors read the XML document character by character and can immediately dispatch the actions associated with each element (or reject the document as invalid).
Furthermore we have devised a simple extension of the C programming language that facilitates the writing of `element actions' in C, making it easy to write really fast XML applications. In particular we represent XML attribute values efficiently in C when this is possible, thus avoiding the otherwise ubiquitous conversions between strings and data values.
FleXML is available for free (from SourceForge). In this paper we present FleXML through an elaborated example and discuss some of the technical issues.
Assume that we have an XML document my-joke.xml
containing the classical joke
<!DOCTYPE joke SYSTEM "my.dtd">
<joke><line>My appartment is so small</line> <suspense/>
<line type='punch-line'>the mice are round-shouldered</line></joke>
(and many others like it, of course). We wish to create an XML
processor to validate it with respect to the DTD in the file
my.dtd
containing
<!-- my.dtd: Small DTD for jokes (just for fun). -->
<!ELEMENT joke (line,(line|suspense)*)>
<!ELEMENT line (#PCDATA)>
<!ATTLIST line type (normal|question|punch-line) 'normal'>
<!ELEMENT suspense EMPTY>
and, furthermore, we wish to write an XML application for
displaying such messages in an amusing way.
With FleXML this can be done by creating an `actions file'
my-show.act
which implements the desired actions
for each element. The remainder of this section explains the
contents of such an actions file.
An actions file is itself an XML document which must begin with
<!DOCTYPE actions SYSTEM "flexml-act.dtd">
<actions>
(the flexml-act.dtd
DTD is part of the FleXML
system and is reproduced in the manual page.
We decide that our application should react to a
line
element by printing the text inside it, and
that it should differentiate between the three possible `type'
attribute values by printing corresponding trailing punctuation.
This introduces a slight complication, because the attribute values are available when parsing the start tag whereas the element text is not available until we parse the end tag (where it has been read).
This means that we must declare a top-level variable.
<top><![CDATA[
char* terminator = ".";
]]></top>
Notice how we use CDATA
sections to make sure that
all characters (including white-space) are passed unmodified to
the C compiler.
With this we can write the action to set it when reading the
line
start tag as
<start tag='line'><![CDATA[
switch ( {type} ) {
case {!type}: terminator = "..."; break;
case {type=normal}: terminator = "."; break;
case {type=question}: terminator = "??"; break;
case {type=punch-line}: terminator = "!!"; break;
}
]]></start>
The idea is that the enumeration attribute type
is
implemented in C as if it had been declared by
enum { {!type}, {type=normal}, {type=question}, {type=punch-line} } {type};
(understanding the {...}
units as C identifiers),
hence the possibility of using the fast C switch
statement to pick the right choice directly. Note that the
first choice, {!type}
, corresponds to not
setting the attribute; in this example the attribute has a
default value so this can never happen, however, we include the
choice anyway to prevent the C compiler from issuing warnings
about missing choices in switch
statements.
With this in place we can write the action for
</line>
. Since it prints something, however,
we first need to add the appropriate header inclusion.
<top><![CDATA[
#include <stdio.h>
]]></top>
<end tag='line'><![CDATA[
printf("%s%s\n", pcdata, terminator);
]]></end>
Finally, we will make the application amusing by `displaying'
the <suspense/>
empty tag as a short delay;
this also involves a header inclusion.
<top><![CDATA[
#include <unistd.h>
]]></top>
<start tag='suspense'><![CDATA[
sleep(2);
]]></start>
That is all; the only remaining thing is to terminate the action file properly.
</actions>
We can now run FleXML with the DTD and the actions file as input and will get an XML application as output that, when run (after processing by flex and a C compiler), will indeed print
My appartment is so small.
the mice are round-shouldered!!
as expected, pausing duly for two seconds between the lines. On
the authors system the above output was achieved with the
command sequence
flexml -A -a my-show.act my.dtd
flex -omy-show.c my-show.l
cc -omy-show my-show.c
./my-show <./my-joke.xml
(see the manual page for the exact meaning of the FleXML options).
An important aspect of the design of FleXML is that the only
thing that should matter to the programmer should be the
complexity of the application, not of the used DTD. As
an example the following action file prints the
href
attribute of all hyperlinks in an XHTML
document:
<!DOCTYPE actions SYSTEM "flexml-act.dtd">
<actions>
<top><![CDATA[
#include <stdio.h>
]]></top>
<start tag='a'><![CDATA[
if ({href}) printf("%s\n", {href});
]]></start>
</actions>
which was compiled into a running application on the author's
system with the commands
flexml $(FLEXDEBUG) -rhtml -p'-//IETF//DTD XHTML 1.0 Transitional//EN' \
'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'
gcc -Wall -ansi -pedantic -O2 -g -c xhtml-href.c -o xhtml-href.o
flex -Bsv -Ca -oxhtml1-transitional.c xhtml1-transitional.l
gcc -Wall -ansi -pedantic -O2 -g -c xhtml1-transitional.c -o xhtml1-transitional.o
gcc -Wall -ansi -pedantic xhtml-href.o xhtml1-transitional.o -o xhtml-href
generating the XML processor directly from the official DTD on
the web (which in fact required a patch to flex to enlarge the
possible table size).
FleXML is a perl script [Perl] which reads and interprets a DTD and subsequently produces an XML processor source file for the flex scanner and optionally an XML application with the C source of the element actions from an actions file. The DTD is used to construct the rules used by flex to match the individual XML components in such a way that only valid documents match.
For example, the flex code for scanning an attribute declaration
of the line
tag is the following:
<AL_line>{
"type"{Eq}"'normal'" |
"type"{Eq}"\"normal\"" A_line_type = A_line_type_normal;
"type"{Eq}"'question'" |
"type"{Eq}"\"question\"" A_line_type = A_line_type_question;
"type"{Eq}"'punch-line'" |
"type"{Eq}"\"punch-line\"" A_line_type = A_line_type_punch_d_line;
">" {
LEAVE; STag_line(); pcdata = BUFFERSET(pcdata); ENTER(IN_line);
}
"/>" {
LEAVE; STag_line(); pcdata = ""; ETag_line();
switch (YY_START) {
case ROOT_line: SET(EPILOG); break;
case S_joke: SET(S_joke_1); break;
case S_joke_1: case S_joke_2: case S_joke_3: SET(S_joke_3); break;
}}}
(with {Eq}
an alias for the regular expression
matching an equal sign (corresponding to production `[25] Eq' of
the XML specification).
This reads as follows: when the XML processor is reading the
attribute list of the line
tag, i.e., when it is in
the <AL_line>
state, a `t
' will
enter an internal state where a `y
' proceeds to
another internal state but other characters makes the document
invalid (because no rule permits it). Once the equal sign has
been scanned, the next characters determine the attribute value,
and at the end one of the flex actions is performed, setting the
attribute value (A_line_type
is the real C for what
we wrote as {type}
, etc.). The important thing is
that one can ensure by careful tuning of the flex rules that a
valid document will proceed only by looking each character up in
a table and determining the subsequent `state' and `action'.
One must avoid pairs of rules such as
"-->" LEAVE(COMMENT);
. SKIP;
(a single `.
' matches any character) because they
mean that the scanner will not be sure after having read a
`-
' character whether it is part of a comment
terminator or `just' a dash. In such cases an extra rule must be
inserted because for the set
"-->" LEAVE(COMMENT);
"--" |
. SKIP;
the problem goes away.
After the actual attribute rules, two rules handle termination
of the attribute list. There are two cases corresponding to
whether we just read a start tag or an empty element. In case
it was a start tag then we must enter the `inner' mode of the
element called IN_line
for the line
element. The switch
handles the state changes
needed for the line element resulting from the fact that the
element can appear in different contexts. This is always
possible to construct because of the requirement that an XML DTD
must be deterministic: we just need an element content
stack (this is what the LEAVE
and
ENTER
macros are for).
In comparison with the forthcoming XML Style-sheet Language [XSL] our approach is much more primitive for better and worse: only a limited range of applications can be produced with FleXML but those will be very fast.
This is useful for XML applications that are meant to process a large number of documents in a fixed format. One such application is the NTSys u-payment transaction server which is implemented as an Apache module where performance is of premium importance. Using FleXML permits a highly modular development of modules for the various transaction types based on a common DTD and a collection of applications that are generated separately and all linked together with the common processor.
FleXML is still under development: please try it out (either from SourceForge or from the Debian GNU/Linux distribution where FleXML is include from release 2.2. The author would welcome comments as to how the system can best evolve. Two things that are definitely on the agenda is a limited `context condition' language for expressing constraints on the position of an element in a document (corresponding to the Xpath subset of XSL), and an efficient way to combine several DTDs into one large t facilitate general XML servers (that can even dispatch to a generic XML interface in cases where the FleXML restrictions are not respected by a document).
I am grateful to NTSys for supporting the development of FleXML. Finally extend my sincere thanks to Jef Poskanzer, Vern Paxson, and the rest of the flex maintainers for a great tool.