Chapter 2. Using Happy

Table of Contents

2.1. Returning other datatypes
2.2. Parsing sequences
2.2.1. Sequences with separators
2.3. Using Precedences
2.3.1. How precedence works
2.3.2. Context-dependent Precedence
2.4. Type Signatures
2.5. Monadic Parsers
2.5.1. Handling Parse Errors
2.5.2. Threaded Lexers
2.5.2.1. Monadic productions with %lexer
2.5.3. Line Numbers
2.5.4. Summary
2.6. The Error Token
2.7. Generating Multiple Parsers From a Single Grammar

Users of Yacc will find Happy quite familiar. The basic idea is as follows:

Let's run through an example. We'll implement a parser for a simple expression syntax, consisting of integers, variables, the operators +, -, *, /, and the form let var = exp in exp. The grammar file starts off like this:

{
module Main where
}

At the top of the file is an optional module header, which is just a Haskell module header enclosed in braces. This code is emitted verbatim into the generated module, so you can put any Haskell code here at all. In a grammar file, Haskell code is always contained between curly braces to distinguish it from the grammar.

In this case, the parser will be a standalone program so we'll call the module Main.

Next comes a couple of declarations:

%name calc
%tokentype { Token }
%error { parseError }

The first line declares the name of the parsing function that Happy will generate, in this case calc. In many cases, this is the only symbol you need to export from the module.

The second line declares the type of tokens that the parser will accept. The parser (i.e. the function calc) will be of type [Token] -> T, where T is the return type of the parser, determined by the production rules below.

The %error directive tells Happy the name of a function it should call in the event of a parse error. More about this later.

Now we declare all the possible tokens:

%token
      let             { TokenLet }
      in              { TokenIn }
      int             { TokenInt $$ }
      var             { TokenVar $$ }
      '='             { TokenEq }
      '+'             { TokenPlus }
      '-'             { TokenMinus }
      '*'             { TokenTimes }
      '/'             { TokenDiv }
      '('             { TokenOB }
      ')'             { TokenCB }

The symbols on the left are the tokens as they will be referred to in the rest of the grammar, and to the right of each token enclosed in braces is a Haskell pattern that matches the token. The parser will expect to receive a stream of tokens, each of which will match one of the given patterns (the definition of the Token datatype is given later).

The $$ symbol is a placeholder that represents the value of this token. Normally the value of a token is the token itself, but by using the $$ symbol you can specify some component of the token object to be the value.

Like yacc, we include %% here, for no real reason.

%%

Now we have the production rules for the grammar.

Exp   : let var '=' Exp in Exp  { Let $2 $4 $6 }
      | Exp1                    { Exp1 $1 }

Exp1  : Exp1 '+' Term           { Plus $1 $3 }
      | Exp1 '-' Term           { Minus $1 $3 }
      | Term                    { Term $1 }

Term  : Term '*' Factor         { Times $1 $3 }
      | Term '/' Factor         { Div $1 $3 }
      | Factor                  { Factor $1 }

Factor
      : int                     { Int $1 }
      | var                     { Var $1 }
      | '(' Exp ')'             { Brack $2 }

Each production consists of a non-terminal symbol on the left, followed by a colon, followed by one or more expansions on the right, separated by |. Each expansion has some Haskell code associated with it, enclosed in braces as usual.

The way to think about a parser is with each symbol having a `value': we defined the values of the tokens above, and the grammar defines the values of non-terminal symbols in terms of sequences of other symbols (either tokens or non-terminals). In a production like this:

n   : t_1 ... t_n   { E }

whenever the parser finds the symbols t_1..t_n in the token stream, it constructs the symbol n and gives it the value E, which may refer to the values of t_1...t_n using the symbols $1...$n.

The parser reduces the input using the rules in the grammar until just one symbol remains: the first symbol defined in the grammar (namely Exp in our example). The value of this symbol is the return value from the parser.

To complete the program, we need some extra code. The grammar file may optionally contain a final code section, enclosed in curly braces.

{

All parsers must include a function to be called in the event of a parse error. In the %error directive earlier, we specified that the function to be called on a parse error is parseError:

parseError :: [Token] -> a
parseError _ = error "Parse error"

Note that parseError must be polymorphic in its return type a, which usually means it must be a call to error. We'll see in Section 2.5, “Monadic Parsers” how to wrap the parser in a monad so that we can do something more sensible with errors. It's also possible to keep track of line numbers in the parser for use in error messages, this is described in Section 2.5.3, “Line Numbers”.

Next we can declare the data type that represents the parsed expression:

data Exp
      = Let String Exp Exp
      | Exp1 Exp1
      deriving Show

data Exp1
      = Plus Exp1 Term
      | Minus Exp1 Term
      | Term Term
      deriving Show

data Term
      = Times Term Factor
      | Div Term Factor
      | Factor Factor
      deriving Show

data Factor
      = Int Int
      | Var String
      | Brack Exp
      deriving Show

And the data structure for the tokens...

data Token
      = TokenLet
      | TokenIn
      | TokenInt Int
      | TokenVar String
      | TokenEq
      | TokenPlus
      | TokenMinus
      | TokenTimes
      | TokenDiv
      | TokenOB
      | TokenCB
 deriving Show

... and a simple lexer that returns this data structure.

lexer :: String -> [Token]
lexer [] = []
lexer (c:cs)
      | isSpace c = lexer cs
      | isAlpha c = lexVar (c:cs)
      | isDigit c = lexNum (c:cs)
lexer ('=':cs) = TokenEq : lexer cs
lexer ('+':cs) = TokenPlus : lexer cs
lexer ('-':cs) = TokenMinus : lexer cs
lexer ('*':cs) = TokenTimes : lexer cs
lexer ('/':cs) = TokenDiv : lexer cs
lexer ('(':cs) = TokenOB : lexer cs
lexer (')':cs) = TokenCB : lexer cs

lexNum cs = TokenInt (read num) : lexer rest
      where (num,rest) = span isDigit cs

lexVar cs =
   case span isAlpha cs of
      ("let",rest) -> TokenLet : lexer rest
      ("in",rest)  -> TokenIn : lexer rest
      (var,rest)   -> TokenVar var : lexer rest

And finally a top-level function to take some input, parse it, and print out the result.

main = getContents >>= print . calc . lexer
}

And that's it! A whole lexer, parser and grammar in a few dozen lines. Another good example is Happy's own parser. Several features in Happy were developed using this as an example.

To generate the Haskell module for this parser, type the command happy example.y (where example.y is the name of the grammar file). The Haskell module will be placed in a file named example.hs. Additionally, invoking the command happy example.y -i will produce the file example.info which contains detailed information about the parser, including states and reduction rules (see Chapter 7, Info Files). This can be invaluable for debugging parsers, but requires some knowledge of the operation of a shift-reduce parser.

2.1. Returning other datatypes

In the above example, we used a data type to represent the syntax being parsed. However, there's no reason why it has to be this way: you could calculate the value of the expression on the fly, using productions like this:

Term  : Term '*' Factor         { $1 * $3 }
      | Term '/' Factor         { $1 / $3 }
      | Factor                  { $1 }

The value of a Term would be the value of the expression itself, and the parser could return an integer.

This works for simple expression types, but our grammar includes variables and the let syntax. How do we know the value of a variable while we're parsing it? We don't, but since the Haskell code for a production can be anything at all, we could make it a function that takes an environment of variable values, and returns the computed value of the expression:

Exp   : let var '=' Exp in Exp  { \p -> $6 (($2,$4 p):p) }
      | Exp1                    { $1 }

Exp1  : Exp1 '+' Term           { \p -> $1 p + $3 p }
      | Exp1 '-' Term           { \p -> $1 p - $3 p }
      | Term                    { $1 }

Term  : Term '*' Factor         { \p -> $1 p * $3 p }
      | Term '/' Factor         { \p -> $1 p `div` $3 p }
      | Factor                  { $1 }

Factor
      : int                     { \p -> $1 }
      | var                     { \p -> case lookup $1 p of
	                                    Nothing -> error "no var"
					    Just i  -> i }
      | '(' Exp ')'             { $2 }

The value of each production is a function from an environment p to a value. When parsing a let construct, we extend the environment with the new binding to find the value of the body, and the rule for var looks up its value in the environment. There's something you can't do in yacc :-)