This section explains how to generate and to use a GLR parser to produce structural results. Please check the examples for further information. Discussion of semantic issues comes later; see Section 3.3, “Including semantic results”.
The process of generating a GLR parser is broadly the same as for standard Happy. You write a grammar specification, run Happy on this to generate some Haskell code, then compile and link this into your program.
An alternative to using Happy directly is to use the BNF Converter tool by Markus Forsberg, Peter Gammie, Michael Pellauer and Aarne Ranta. This tool creates an abstract syntax, grammar, pretty-printer and other useful items from a single grammar formalism, thus it saves a lot of work and improves maintainability. The current output of BNFC can be used with GLR mode now with just a few small changes, but from January 2005 we expect to have a fully-compatible version of BNFC.
Most of the features of Happy still work, but note the important points below.
The GLR parser is generated in TWO files, one for data and one for the driver. This is because the driver code needs to be optimized, but for large parsers with lots of data, optimizing the data tables too causes compilation to be too slow.
Given a file Foo.y
, file
FooData.hs
is generated with basic type
information, the parser tables, and the header and tail code
that was included in the parser specification.
Note that Happy generates the
module declaration line, so you should NOT give it in the
grammar file.
The driver is placed in file Foo.hs
, and
does not contain any user-supplied text.
You can declare a lexer (and error token) with the
%lexer
directive as normal, but the
generated parser does NOT call this lexer automatically.
The action of the directive is only to
export the lexer function to the top
level. This is because some applications need finer control
of the lexing process.
This still works, but note the reasons. The precedence and associativity declarations are used in Happy's LR table creation to resolve certain conflicts. It does this by retaining the actions implied by the declarations and removing the ones which clash with these. The GLR parser back-end then produces code from these filtered tables, hence the rejected actions are never considered by the GLR parser.
Hence, declaring precedence and associativity is still a good thing, since it avoids a certain amount of ambiguity that the user knows how to remove.
There is some support for monadic parsers.
The "tree decoding" mode
(see Section 3.3.2, “Tree decoding”) can use the
information given in the %monad
declaration to monadify the decoding process.
This is explained in more detail in
Section 3.3.4, “Monadic tree decoding”.
Note: the generated parsers don't include Ashley Yakeley's monad context information yet. It is currently just ignored. If this is a problem, email and I'll make the changes required.
This has no effect at present. It will probably remain this way: if you want to control names, you could use qualified import.
The generation of semantic code relies on type information
given in the grammar specification. If you don't give an
explicit signature, the type ()
is
assumed. If you get type clashes mentioning
()
you may need to add type annotations.
Similarly, if you don't supply code for the semantic rule
portion, then the value ()
is used.
error
symbol in grammars, and recovery
No attempt to implement this yet. Any use of
error
in grammars is thus ignored, and
parse errors will eventually mean a parse will fail.
The type used for tokens must be in
the Ord
type class (and hence in
Eq
), plus it is recommended that they
are in the Show
class too.
The ordering is required for the implementation of
ambiguity packing.
It may be possible to relax this requirement, but it
is probably simpler just to require instances of the type
classes. Please tell us if this is a problem.
The driver file exports a function
doParse :: [[UserDefTok]] -> GLRResult
.
If you are using several parsers, use qualified naming to
distinguish them.
UserDefTok
is a synonym for the type declared with
the %tokentype
directive.
The input to doParse
is a list of
list of token values.
The outer level represents the sequence of input symbols, and
the inner list represents ambiguity in the tokenisation of each
input symbol.
For example, the word "run" can be at least a noun or a verb,
hence the inner list will contain at least two values.
If your tokens are not ambiguous, you will need to convert each
token to a singleton list before parsing.
The parse result is expressed with the following types. A successful parse yields a forest (explained below) and a single root node for the forest. A parse may fail for one of two reasons: running out of input or a (global) parse error. A global parse error means that it was not possible to continue parsing any of the live alternatives; this is different from a local error, which simply means that the current alternative dies and we try some other alternative. In both error cases, the forest at failure point is returned, since it may contain useful information. Unconsumed tokens are returned when there is a global parse error.
type ForestId = (Int,Int,GSymbol) data GSymbol = <... automatically generated ...> type Forest = FiniteMap ForestId [Branch] type RootNode = ForestId type Tokens = [[(Int, GSymbol)]] data Branch = Branch {b_sem :: GSem, b_nodes :: [ForestId]} data GSem = <... automatically generated ...> data GLRResult = ParseOK RootNode Forest -- forest with root | ParseError Tokens Forest -- partial forest with bad input | ParseEOF Forest -- partial forest (missing input)
Conceptually, the parse forest is a directed, acyclic and-or
graph. It is represented by a mapping of ForestId
s
to lists of possible analyses. The FiniteMap
type is used to provide efficient and convenient access.
The ForestId
type identifies nodes in the
graph, named by the range of input they span and the category of
analysis they license. GSymbol
is generated
automatically as a union of the names of grammar rules (prefixed
by G_
to avoid name clashes) and of tokens and
an EOF symbol. Tokens are wrapped in the constructor
HappyTok :: UserDefTok -> GSymbol
.
The Branch
type represents a match for some
right-hand side of a production, containing semantic information
(see below)
and a list of sub-analyses. Each of these is a node in the graph.
Note that tokens are represented as childless nodes that span
one input position. Empty productions will appear as childless nodes
that start and end at the same position.
Happy will generate two files, and these
should be compiled as normal Haskell files.
If speed is an issue, then you should use the -O
flags etc with the driver code, and if feasible, with the parser
tables too.
You can also use the --ghc
flag to trigger certain
GHC-specific optimizations. At present,
this just causes use of unboxed types in the tables and in some key
code.
Using this flag causes relevant GHC
option pragmas to be inserted into the generated code, so you shouldn't
have to use any strange flags (unless you want to...).