Table of Contents
Users of Yacc will find Happy quite familiar. The basic idea is as follows:
Define the grammar you want to parse in a Happy grammar file.
Run the grammar through Happy, to generate a compilable Haskell module.
Use this module as part of your Haskell program, usually in conjunction with a lexical analyser (a function that splits the input into ``tokens'', the basic unit of parsing).
Let's run through an example. We'll implement a parser for a
simple expression syntax, consisting of integers, variables, the
operators +
, -
, *
,
/
, and the form let var = exp in exp
.
The grammar file starts off like this:
{ module Main where }
At the top of the file is an optional module header, which is just a Haskell module header enclosed in braces. This code is emitted verbatim into the generated module, so you can put any Haskell code here at all. In a grammar file, Haskell code is always contained between curly braces to distinguish it from the grammar.
In this case, the parser will be a standalone program so
we'll call the module Main
.
Next comes a couple of declarations:
%name calc %tokentype { Token } %error { parseError }
The first line declares the name of the parsing function
that Happy will generate, in this case
calc
. In many cases, this is the only symbol you need
to export from the module.
The second line declares the type of tokens that the parser
will accept. The parser (i.e. the function
calc
) will be of type [Token] ->
T
, where T
is the return type of the
parser, determined by the production rules below.
The %error
directive tells Happy the name
of a function it should call in the event of a parse error. More
about this later.
Now we declare all the possible tokens:
%token let { TokenLet } in { TokenIn } int { TokenInt $$ } var { TokenVar $$ } '=' { TokenEq } '+' { TokenPlus } '-' { TokenMinus } '*' { TokenTimes } '/' { TokenDiv } '(' { TokenOB } ')' { TokenCB }
The symbols on the left are the tokens as they will be
referred to in the rest of the grammar, and to the right of each
token enclosed in braces is a Haskell pattern that matches the
token. The parser will expect to receive a stream of tokens, each
of which will match one of the given patterns (the definition of
the Token
datatype is given later).
The $$
symbol is a placeholder that
represents the value of this token. Normally the value
of a token is the token itself, but by using the
$$
symbol you can specify some component
of the token object to be the value.
Like yacc, we include %%
here, for no real
reason.
%%
Now we have the production rules for the grammar.
Exp : let var '=' Exp in Exp { Let $2 $4 $6 } | Exp1 { Exp1 $1 } Exp1 : Exp1 '+' Term { Plus $1 $3 } | Exp1 '-' Term { Minus $1 $3 } | Term { Term $1 } Term : Term '*' Factor { Times $1 $3 } | Term '/' Factor { Div $1 $3 } | Factor { Factor $1 } Factor : int { Int $1 } | var { Var $1 } | '(' Exp ')' { Brack $2 }
Each production consists of a non-terminal
symbol on the left, followed by a colon, followed by one or more
expansions on the right, separated by |
. Each expansion
has some Haskell code associated with it, enclosed in braces as
usual.
The way to think about a parser is with each symbol having a `value': we defined the values of the tokens above, and the grammar defines the values of non-terminal symbols in terms of sequences of other symbols (either tokens or non-terminals). In a production like this:
n : t_1 ... t_n { E }
whenever the parser finds the symbols t_1..t_n
in
the token stream, it constructs the symbol n
and gives
it the value E
, which may refer to the values of
t_1...t_n
using the symbols
$1...$n
.
The parser reduces the input using the rules in the grammar
until just one symbol remains: the first symbol defined in the
grammar (namely Exp
in our example). The value of this
symbol is the return value from the parser.
To complete the program, we need some extra code. The grammar file may optionally contain a final code section, enclosed in curly braces.
{
All parsers must include a function to be called in the
event of a parse error. In the %error
directive earlier, we specified that the function to be called on
a parse error is parseError
:
parseError :: [Token] -> a parseError _ = error "Parse error"
Note that parseError
must be polymorphic
in its return type a
, which usually means it
must be a call to error
. We'll see in Section 2.5, “Monadic Parsers” how to wrap the parser in a monad so that we
can do something more sensible with errors. It's also possible to
keep track of line numbers in the parser for use in error
messages, this is described in Section 2.5.3, “Line Numbers”.
Next we can declare the data type that represents the parsed expression:
data Exp = Let String Exp Exp | Exp1 Exp1 deriving Show data Exp1 = Plus Exp1 Term | Minus Exp1 Term | Term Term deriving Show data Term = Times Term Factor | Div Term Factor | Factor Factor deriving Show data Factor = Int Int | Var String | Brack Exp deriving Show
And the data structure for the tokens...
data Token = TokenLet | TokenIn | TokenInt Int | TokenVar String | TokenEq | TokenPlus | TokenMinus | TokenTimes | TokenDiv | TokenOB | TokenCB deriving Show
... and a simple lexer that returns this data structure.
lexer :: String -> [Token] lexer [] = [] lexer (c:cs) | isSpace c = lexer cs | isAlpha c = lexVar (c:cs) | isDigit c = lexNum (c:cs) lexer ('=':cs) = TokenEq : lexer cs lexer ('+':cs) = TokenPlus : lexer cs lexer ('-':cs) = TokenMinus : lexer cs lexer ('*':cs) = TokenTimes : lexer cs lexer ('/':cs) = TokenDiv : lexer cs lexer ('(':cs) = TokenOB : lexer cs lexer (')':cs) = TokenCB : lexer cs lexNum cs = TokenInt (read num) : lexer rest where (num,rest) = span isDigit cs lexVar cs = case span isAlpha cs of ("let",rest) -> TokenLet : lexer rest ("in",rest) -> TokenIn : lexer rest (var,rest) -> TokenVar var : lexer rest
And finally a top-level function to take some input, parse it, and print out the result.
main = getContents >>= print . calc . lexer }
And that's it! A whole lexer, parser and grammar in a few dozen lines. Another good example is Happy's own parser. Several features in Happy were developed using this as an example.
To generate the Haskell module for this parser, type the
command happy example.y (where
example.y
is the name of the grammar file).
The Haskell module will be placed in a file named
example.hs
. Additionally, invoking the
command happy example.y -i will produce the
file example.info
which contains detailed information
about the parser, including states and reduction rules (see Chapter 7, Info Files). This can be invaluable for debugging
parsers, but requires some knowledge of the operation of a
shift-reduce parser.
In the above example, we used a data type to represent the syntax being parsed. However, there's no reason why it has to be this way: you could calculate the value of the expression on the fly, using productions like this:
Term : Term '*' Factor { $1 * $3 } | Term '/' Factor { $1 / $3 } | Factor { $1 }
The value of a Term
would be the value of the
expression itself, and the parser could return an integer.
This works for simple expression types, but our grammar
includes variables and the let
syntax. How do we know
the value of a variable while we're parsing it? We don't, but
since the Haskell code for a production can be anything at all,
we could make it a function that takes an environment of
variable values, and returns the computed value of the
expression:
Exp : let var '=' Exp in Exp { \p -> $6 (($2,$4 p):p) } | Exp1 { $1 } Exp1 : Exp1 '+' Term { \p -> $1 p + $3 p } | Exp1 '-' Term { \p -> $1 p - $3 p } | Term { $1 } Term : Term '*' Factor { \p -> $1 p * $3 p } | Term '/' Factor { \p -> $1 p `div` $3 p } | Factor { $1 } Factor : int { \p -> $1 } | var { \p -> case lookup $1 p of Nothing -> error "no var" Just i -> i } | '(' Exp ')' { $2 }
The value of each production is a function from an
environment p to a value. When parsing a
let
construct, we extend the environment with the new
binding to find the value of the body, and the rule for
var
looks up its value in the environment. There's
something you can't do in yacc
:-)