3.2. Basic use of a Happy-generated GLR parser

3.2. Basic use of a Happy-generated GLR parser
Prev	Chapter 3. Generalized LR Parsing	Next

This section explains how to generate and to use a GLR parser to produce structural results. Please check the examples for further information. Discussion of semantic issues comes later; see Section 3.3, “Including semantic results”.

3.2.1. Overview

The process of generating a GLR parser is broadly the same as for standard Happy. You write a grammar specification, run Happy on this to generate some Haskell code, then compile and link this into your program.

An alternative to using Happy directly is to use the BNF Converter tool by Markus Forsberg, Peter Gammie, Michael Pellauer and Aarne Ranta. This tool creates an abstract syntax, grammar, pretty-printer and other useful items from a single grammar formalism, thus it saves a lot of work and improves maintainability. The current output of BNFC can be used with GLR mode now with just a few small changes, but from January 2005 we expect to have a fully-compatible version of BNFC.

Most of the features of Happy still work, but note the important points below.

module header

The GLR parser is generated in TWO files, one for data and one for the driver. This is because the driver code needs to be optimized, but for large parsers with lots of data, optimizing the data tables too causes compilation to be too slow.

Given a file Foo.y, file FooData.hs is generated with basic type information, the parser tables, and the header and tail code that was included in the parser specification. Note that Happy generates the module declaration line, so you should NOT give it in the grammar file. The driver is placed in file Foo.hs, and does not contain any user-supplied text.

export of lexer

You can declare a lexer (and error token) with the %lexer directive as normal, but the generated parser does NOT call this lexer automatically. The action of the directive is only to export the lexer function to the top level. This is because some applications need finer control of the lexing process.

precedence information

This still works, but note the reasons. The precedence and associativity declarations are used in Happy's LR table creation to resolve certain conflicts. It does this by retaining the actions implied by the declarations and removing the ones which clash with these. The GLR parser back-end then produces code from these filtered tables, hence the rejected actions are never considered by the GLR parser.

Hence, declaring precedence and associativity is still a good thing, since it avoids a certain amount of ambiguity that the user knows how to remove.

monad directive

There is some support for monadic parsers. The "tree decoding" mode (see Section 3.3.2, “Tree decoding”) can use the information given in the %monad declaration to monadify the decoding process. This is explained in more detail in Section 3.3.4, “Monadic tree decoding”.

Note: the generated parsers don't include Ashley Yakeley's monad context information yet. It is currently just ignored. If this is a problem, email and I'll make the changes required.

parser name directive

This has no effect at present. It will probably remain this way: if you want to control names, you could use qualified import.

type information on non-terminals

The generation of semantic code relies on type information given in the grammar specification. If you don't give an explicit signature, the type () is assumed. If you get type clashes mentioning () you may need to add type annotations. Similarly, if you don't supply code for the semantic rule portion, then the value () is used.

error symbol in grammars, and recovery

No attempt to implement this yet. Any use of error in grammars is thus ignored, and parse errors will eventually mean a parse will fail.

the token type

The type used for tokens must be in the Ord type class (and hence in Eq), plus it is recommended that they are in the Show class too. The ordering is required for the implementation of ambiguity packing. It may be possible to relax this requirement, but it is probably simpler just to require instances of the type classes. Please tell us if this is a problem.

3.2.2. The main function

The driver file exports a function doParse :: [[UserDefTok]] -> GLRResult. If you are using several parsers, use qualified naming to distinguish them. UserDefTok is a synonym for the type declared with the %tokentype directive.

3.2.3. The input

The input to doParse is a list of list of token values. The outer level represents the sequence of input symbols, and the inner list represents ambiguity in the tokenisation of each input symbol. For example, the word "run" can be at least a noun or a verb, hence the inner list will contain at least two values. If your tokens are not ambiguous, you will need to convert each token to a singleton list before parsing.

3.2.4. The Parse Result

The parse result is expressed with the following types. A successful parse yields a forest (explained below) and a single root node for the forest. A parse may fail for one of two reasons: running out of input or a (global) parse error. A global parse error means that it was not possible to continue parsing any of the live alternatives; this is different from a local error, which simply means that the current alternative dies and we try some other alternative. In both error cases, the forest at failure point is returned, since it may contain useful information. Unconsumed tokens are returned when there is a global parse error.

	type ForestId = (Int,Int,GSymbol)
	data GSymbol  = <... automatically generated ...>
	type Forest   = FiniteMap ForestId [Branch]
	type RootNode = ForestId
	type Tokens   = [[(Int, GSymbol)]]
	data Branch   = Branch {b_sem :: GSem, b_nodes :: [ForestId]}
	data GSem     = <... automatically generated ...>

	data GLRResult
	 = ParseOK     RootNode Forest    -- forest with root
	 | ParseError  Tokens   Forest    -- partial forest with bad input
	 | ParseEOF             Forest    -- partial forest (missing input)

Conceptually, the parse forest is a directed, acyclic and-or graph. It is represented by a mapping of ForestIds to lists of possible analyses. The FiniteMap type is used to provide efficient and convenient access. The ForestId type identifies nodes in the graph, named by the range of input they span and the category of analysis they license. GSymbol is generated automatically as a union of the names of grammar rules (prefixed by G_ to avoid name clashes) and of tokens and an EOF symbol. Tokens are wrapped in the constructor HappyTok :: UserDefTok -> GSymbol.

The Branch type represents a match for some right-hand side of a production, containing semantic information (see below) and a list of sub-analyses. Each of these is a node in the graph. Note that tokens are represented as childless nodes that span one input position. Empty productions will appear as childless nodes that start and end at the same position.

3.2.5. Compiling the parser

Happy will generate two files, and these should be compiled as normal Haskell files. If speed is an issue, then you should use the -O flags etc with the driver code, and if feasible, with the parser tables too.

You can also use the --ghc flag to trigger certain GHC-specific optimizations. At present, this just causes use of unboxed types in the tables and in some key code. Using this flag causes relevant GHC option pragmas to be inserted into the generated code, so you shouldn't have to use any strange flags (unless you want to...).