2.2. Parsing sequences

2.2. Parsing sequences
Prev	Chapter 2. Using Happy	Next

A common feature in grammars is a sequence of a particular syntactic element. In EBNF, we'd write something like n+ to represent a sequence of one or more ns, and n* for zero or more. Happy doesn't support this syntax explicitly, but you can define the equivalent sequences using simple productions.

For example, the grammar for Happy itself contains a rule like this:

prods : prod                   { [$1] }
      | prods prod             { $2 : $1 }

In other words, a sequence of productions is either a single production, or a sequence of productions followed by a single production. This recursive rule defines a sequence of one or more productions.

One thing to note about this rule is that we used left recursion to define it - we could have written it like this:

prods : prod                  { [$1] }
      | prod prods            { $1 : $2 }

The only reason we used left recursion is that Happy is more efficient at parsing left-recursive rules; they result in a constant stack-space parser, whereas right-recursive rules require stack space proportional to the length of the list being parsed. This can be extremely important where long sequences are involved, for instance in automatically generated output. For example, the parser in GHC used to use right-recursion to parse lists, and as a result it failed to parse some Happy-generated modules due to running out of stack space!

One implication of using left recursion is that the resulting list comes out reversed, and you have to reverse it again to get it in the original order. Take a look at the Happy grammar for Haskell for many examples of this.

Parsing sequences of zero or more elements requires a trivial change to the above pattern:

prods : {- empty -}           { [] }
      | prods prod            { $2 : $1 }

Yes - empty productions are allowed. The normal convention is to include the comment {- empty -} to make it more obvious to a reader of the code what's going on.

2.2.1. Sequences with separators

A common type of sequence is one with a separator: for instance function bodies in C consist of statements separated by semicolons. To parse this kind of sequence we use a production like this:

stmts : stmt                   { [$1] }
      | stmts ';' stmt         { $3 : $1 }

If the ; is to be a terminator rather than a separator (i.e. there should be one following each statement), we can remove the semicolon from the above rule and redefine stmt as

stmt : stmt1 ';'              { $1 }

where stmt1 is the real definition of statements.

We might like to allow extra semicolons between statements, to be a bit more liberal in what we allow as legal syntax. We probably just want the parser to ignore these extra semicolons, and not generate a ``null statement'' value or something. The following rule parses a sequence or zero or more statements separated by semicolons, in which the statements may be empty:

stmts : stmts ';' stmt          { $3 : $1 }
      | stmts ';'               { $1 }
      | stmt			{ [$1] }
      | {- empty -}		{ [] }

Parsing sequences of one or more possibly null statements is left as an exercise for the reader...