Happy has support for threading a monad through the generated parser. This might be useful for several reasons:
Handling parse errors by using an exception monad (see Section 2.5.1, “Handling Parse Errors”).
Keeping track of line numbers in the input file, for example for use in error messages (see Section 2.5.3, “Line Numbers”).
Performing IO operations during parsing.
Parsing languages with context-dependencies (such as C) require some state in the parser.
Adding monadic support to your parser couldn't be simpler. Just add the following directive to the declaration section of the grammar file:
%monad { <type> } [ { <then> } { <return> } ]
where <type>
is the type constructor for
the monad, <then>
is the bind operation of the
monad, and <return>
is the return operation. If
you leave out the names for the bind and return operations,
Happy assumes that <type>
is an
instance of the standard Haskell type class Monad
and
uses the overloaded names for the bind and return
operations.
When this declaration is included in the grammar,
Happy makes a couple of changes to the generated
parser: the types of the main parser function and
parseError
(the function named in
%error
) become [Token] -> P a
where
P
is the monad type constructor, and the function must
be polymorphic in a
. In other words,
Happy adds an application of the
<return>
operation defined in the declaration
above, around the result of the parser (parseError
is
affected because it must have the same return type as the
parser). And that's all it does.
This still isn't very useful: all you can do is return
something of monadic type from parseError
. How do you
specify that the productions can also have type P a
?
Most of the time, you don't want a production to have this type:
you'd have to write explicit returnP
s everywhere.
However, there may be a few rules in a grammar that need to get
at the monad, so Happy has a special syntax for
monadic actions:
n : t_1 ... t_n {% <expr> }
The %
in the action indicates that this is a
monadic action, with type P a
, where a
is
the real return type of the production. When
Happy reduces one of these rules, it evaluates the
expression
<expr> `then` \result -> <continue parsing>
Happy uses result
as the real
semantic value of the production. During parsing, several
monadic actions might be reduced, resulting in a sequence
like
<expr1> `then` \r1 -> <expr2> `then` \r2 -> ... return <expr3>
The monadic actions are performed in the order that they are reduced. If we consider the parse as a tree, then reductions happen in a depth-first left-to-right manner. The great thing about adding a monad to your parser is that it doesn't impose any performance overhead for normal reductions - only the monadic ones are translated like this.
Take a look at the Haskell parser for a good illustration of how to use a monad in your parser: it contains examples of all the principles discussed in this section, namely parse errors, a threaded lexer, line/column numbers, and state communication between the parser and lexer.
The following sections consider a couple of uses for monadic parsers, and describe how to also thread the monad through the lexical analyser.
It's not very convenient to just call error
when
a parse error is detected: in a robust setting, you'd like the
program to recover gracefully and report a useful error message
to the user. Exceptions (of which errors are a special case)
are normally implemented in Haskell by using an exception monad,
something like:
data E a = Ok a | Failed String thenE :: E a -> (a -> E b) -> E b m `thenE` k = case m of Ok a -> k a Failed e -> Failed e returnE :: a -> E a returnE a = Ok a failE :: String -> E a failE err = Failed err catchE :: E a -> (String -> E a) -> E a catchE m k = case m of Ok a -> OK a Failed e -> k e
This monad just uses a string as the error type. The
functions thenE
and returnE
are the usual
bind and return operations of the monad, failE
raises an error, and catchE
is a combinator for
handling exceptions.
We can add this monad to the parser with the declaration
%monad { E } { thenE } { returnE }
Now, without changing the grammar, we can change the
definition of parseError
and have something sensible
happen for a parse error:
parseError tokens = failE "Parse error"
The parser now raises an exception in the monad instead of bombing out on a parse error.
We can also generate errors during parsing. There are times when it is more convenient to parse a more general language than that which is actually intended, and check it later. An example comes from Haskell, where the precedence values in infix declarations must be between 0 and 9:
prec :: { Int } : int {% if $1 < 0 || $1 > 9 then failE "Precedence out of range" else returnE $1 }
The monadic action allows the check to be placed in the parser itself, where it belongs.
Happy allows the monad concept to be extended to the lexical analyser, too. This has several useful consequences:
Lexical errors can be treated in the same way as parse errors, using an exception monad.
Information such as the current file and line number can be communicated between the lexer and parser.
General state communication between the parser and lexer - for example, implementation of the Haskell layout rule requires this kind of interaction.
IO operations can be performed in the lexer - this could be useful for following import/include declarations for instance.
A monadic lexer is requested by adding the following declaration to the grammar file:
%lexer { <lexer> } { <eof> }
where <lexer>
is the name of the lexical
analyser function, and <eof>
is a token that
is to be treated as the end of file.
When using a monadic lexer, the parser no longer reads a list of tokens. Instead, it calls the lexical analysis function for each new token to be read. This has the side effect of eliminating the intermediate list of tokens, which is a slight performance win.
The type of the main parser function is now just
P a
- the input is being handled completely
within the monad.
The type of parseError
becomes
Token -> P a
; that is it takes Happy's
current lookahead token as input. This can be useful, because
the error function probably wants to report the token at which
the parse error occurred, and otherwise the lexer would have
to store this token in the monad.
The lexical analysis function must have the following type:
lexer :: (Token -> P a) -> P a
where P
is the monad type constructor declared
with %monad
, and a
can be replaced by the
parser return type if desired.
You can see from this type that the lexer takes a continuation as an argument. The lexer is to find the next token, and pass it to this continuation to carry on with the parse. Obviously, we need to keep track of the input in the monad somehow, so that the lexer can do something different each time it's called!
Let's take the exception monad above, and extend it to add the input string so that we can use it with a threaded lexer.
data ParseResult a = Ok a | Failed String type P a = String -> ParseResult a thenP :: P a -> (a -> P b) -> P b m `thenP` k = \s -> case m s of Ok a -> k a s Failed e -> Failed e returnP :: a -> P a returnP a = \s -> Ok a failP :: String -> P a failP err = \s -> Failed err catchP :: P a -> (String -> P a) -> P a catchP m k = \s -> case m s of Ok a -> OK a Failed e -> k e s
Notice that this isn't a real state monad - the input string just gets passed around, not returned. Our lexer will now look something like this:
lexer :: (Token -> P a) -> P a lexer cont s = ... lexical analysis code ... cont token s'
the lexer grabs the continuation and the input string,
finds the next token token
, and passes it together
with the remaining input string s'
to the
continuation.
We can now indicate lexical errors by ignoring the
continuation and calling failP "error message" s
within the lexer (don't forget to pass the input string to
make the types work out).
This may all seem a bit weird. Why, you ask, doesn't
the lexer just have type P Token
? It was
done this way for performance reasons - this formulation
sometimes means that you can use a reader monad instead of a
state monad for P
, and the reader monad
might be faster. It's not at all clear that this reasoning
still holds (or indeed ever held), and it's entirely possible
that the use of a continuation here is just a
misfeature.
If you want a lexer of type P Token
,
then just define a wrapper to deal with the
continuation:
lexwrap :: (Token -> P a) -> P a lexwrap cont = real_lexer `thenP` \token -> cont token
The {% ... }
actions work fine with
%lexer
, but additionally there are two more
forms which are useful in certain cases. Firstly:
n : t_1 ... t_n {%^ <expr> }
In this case, <expr>
has type
Token -> P a
. That is, Happy passes the
current lookahead token to the monadic action
<expr>
. This is a useful way to get
hold of Happy's current lookahead token without having to
store it in the monad.
n : t_1 ... t_n {%% <expr> }
This is a slight variant on the previous form. The type
of <expr>
is the same, but in this
case the lookahead token is actually discarded and a new token
is read from the input. This can be useful when you want to
change the next token and continue parsing.
Previous versions of Happy had a
%newline
directive that enabled simple line numbers
to be counted by the parser and referenced in the actions. We
warned you that this facility may go away and be replaced by
something more general, well guess what? :-)
Line numbers can now be dealt with quite straightforwardly using a monadic parser/lexer combination. Ok, we have to extend the monad a bit more:
type LineNumber = Int type P a = String -> LineNumber -> ParseResult a getLineNo :: P LineNumber getLineNo = \s l -> Ok l
(the rest of the functions in the monad follow by just adding the extra line number argument in the same way as the input string). Again, the line number is just passed down, not returned: this is OK because of the continuation-based lexer that can change the line number and pass the new one to the continuation.
The lexer can now update the line number as follows:
lexer cont s = case s of '\n':s -> \line -> lexer cont s (line + 1) ... rest of lexical analysis ...
It's as simple as that. Take a look at Happy's own parser if you have the sources lying around, it uses a monad just like the one above.
Reporting the line number of a parse error is achieved
by changing parseError
to look something like
this:
parseError :: Token -> P a parseError = getLineNo `thenP` \line -> failP (show line ++ ": parse error")
We can also get hold of the line number during parsing, to put it in the parsed data structure for future reference. A good way to do this is to have a production in the grammar that returns the current line number:
lineno :: { LineNumber } : {- empty -} {% getLineNo }
The semantic value of lineno
is the line
number of the last token read - this will always be the token
directly following the lineno
symbol in the grammar,
since Happy always keeps one lookahead token in
reserve.
The types of various functions related to the parser are
dependent on what combination of %monad
and
%lexer
directives are present in the grammar. For
reference, we list those types here. In the following types,
t is the return type of the
parser. A type containing a type variable indicates that the
specified function must be polymorphic.
No %monad
or
%lexer
.
parse :: [Token] -> t
parseError :: [Token] -> a
with %monad
.
parse :: [Token] -> P t
parseError :: [Token] -> P a
with %lexer
.
parse :: T t
parseError :: Token -> T a
lexer :: (Token -> T a) -> T a
where the type constructor T
is whatever you want (usually T
a = String -> a
. I'm not sure if this is useful, or even if it works
properly.
with %monad
and %lexer
.
parse :: P t
parseError :: Token -> P a
lexer :: (Token -> P a) -> P a