Chapter 3. Generalized LR Parsing

Table of Contents

3.1. Introduction
3.2. Basic use of a Happy-generated GLR parser
3.2.1. Overview
3.2.2. The main function
3.2.3. The input
3.2.4. The Parse Result
3.2.5. Compiling the parser
3.3. Including semantic results
3.3.1. Forms of semantics
3.3.2. Tree decoding
3.3.3. Label decoding
3.3.4. Monadic tree decoding
3.4. Further information
3.4.1. The GLR examples
3.4.2. Viewing forests as graphs
3.4.3. Some Applications of GLR parsing
3.4.4. Technical details
3.4.5. The --filter option
3.4.6. Limitations and future work
3.4.7. Thanks and acknowledgements

This chapter explains how to use the GLR parsing extension, which allows Happy to parse ambiguous grammars and produce useful results. This extension is triggered with the --glr flag, which causes Happy to use a different driver for the LALR(1) parsing tables. The result of parsing is a structure which encodes compactly all of the possible parses. There are two options for how semantic information is combined with the structural information.

This extension was developed by Paul Callaghan and Ben Medlock (University of Durham). It is based on the structural parser implemented in Medlock's undergraduate project, but significantly extended and improved by Callaghan. Bug reports, comments, questions etc should be sent to . Further information can be found on Callaghan's GLR parser page.

3.1. Introduction

Here's an ambiguous grammar. It has no information about the associativity of +, so for example, 1+2+3 can be parsed as (1+(2+3)) or ((1+2)+3). In conventional mode, Happy, would complain about a shift/reduce conflict, although it would generate a parser which always shifts in such a conflict, and hence would produce only the first alternative above.

      E -> E + E
      E -> i       -- any integer
      

GLR parsing will accept this grammar without complaint, and produce a result which encodes both alternatives simultaneously. Now consider the more interesting example of 1+2+3+4, which has five distinct parses -- try to list them! You will see that some of the subtrees are identical. A further property of the GLR output is that such sub-results are shared, hence efficiently represented: there is no combinatorial explosion. Below is the simplified output of the GLR parser for this example.

	Root (0,7,G_E)
	(0,1,G_E)     => [[(0,1,Tok '1'))]]
	(0,3,G_E)     => [[(0,1,G_E),(1,2,Tok '+'),(2,3,G_E)]]
	(0,5,G_E)     => [[(0,1,G_E),(1,2,Tok '+'),(2,5,G_E)]
	                 ,[(0,3,G_E),(3,4,Tok '+'),(4,5,G_E)]]
	(0,7,G_E)     => [[(0,3,G_E),(3,4,Tok '+'),(4,7,G_E)]
	                 ,[(0,1,G_E),(1,2,Tok '+'),(2,7,G_E)]
	                 ,[(0,5,G_E),(5,6,Tok '+'),(6,7,G_E)]}]
	(2,3,G_E)     => [[(2,3,Tok '2'))]}]
	(2,5,G_E)     => [[(2,3,G_E),(3,4,Tok '+'),(4,5,G_E)]}]
	(2,7,G_E)     => [[(2,3,G_E),(3,4,Tok '+'),(4,7,G_E)]}
	                 ,[(2,5,G_E),(5,6,Tok '+'),(6,7,G_E)]}]
	(4,5,G_E)     => [[(4,5,Tok '3'))]}]
	(4,7,G_E)     => [[(4,5,G_E),(5,6,Tok '+'),(6,7,G_E)]}]
	(6,7,G_E)     => [[(6,7,Tok '4'))]}]
      

This is a directed, acyclic and-or graph. The node "names" are of form (a,b,c) where a and b are the start and end points (as positions in the input string) and c is a category (or name of grammar rule). For example (2,7,G_E) spans positions 2 to 7 and contains analyses which match the E grammar rule. Such analyses are given as a list of alternatives (disjunctions), each corresponding to some use of a production of that category, which in turn are a conjunction of sub-analyses, each represented as a node in the graph or an instance of a token.

Hence (2,7,G_E) contains two alternatives, one which has (2,3,G_E) as its first child and the other with (2,5,G_E) as its first child, respectively corresponding to sub-analyses (2+(3+4)) and ((2+3)+4). Both alternatives have the token + as their second child, but note that they are difference occurrences of + in the input! We strongly recommend looking at such results in graphical form to understand these points. If you build the expr-eval example in the directory examples/glr (NB you need to use GHC for this, unless you know how to use the -F flag for Hugs), running the example will produce a file which can be viewed with the daVinci graph visualization tool. (See http://www.informatik.uni-bremen.de/~davinci/ for more information. Educational use licenses are currently available without charge.)

The GLR extension also allows semantic information to be attached to productions, as in conventional Happy, although there are further issues to consider. Two modes are provided, one for simple applications and one for more complex use. See Section 3.3, “Including semantic results”. The extension is also integrated with Happy's token handling, e.g. extraction of information from tokens.

One key feature of this implementation in Haskell is that its main result is a graph. Other implementations effectively produce a list of trees, but this limits practical use to small examples. For large and interesting applications, some of which are discussed in Section 3.4.3, “Some Applications of GLR parsing”, a graph is essential due to the large number of possibilities and the need to analyse the structure of the ambiguity. Converting the graph to trees could produce huge numbers of results and will lose information about sharing etc.

One final comment. You may have learnt through using yacc-style tools that ambiguous grammars are to be avoided, and that ambiguity is something that appears only in Natural Language processing. This is definitely not true. Many interesting grammars are ambiguous, and with GLR tools they can be used effectively. We hope you enjoy exploring this fascinating area!