Chapter 5. The Interface to an Alex-generated lexer

Table of Contents

5.1. Unicode and UTF-8
5.2. Basic interface
5.3. Wrappers
5.3.1. The "basic" wrapper
5.3.2. The "posn" wrapper
5.3.3. The "monad" wrapper
5.3.4. The "monadUserState" wrapper
5.3.5. The "gscan" wrapper
5.3.6. The bytestring wrappers
5.3.6.1. The "basic-bytestring" wrapper
5.3.6.2. The "posn-bytestring" wrapper
5.3.6.3. The "monad-bytestring" wrapper
5.3.6.4. The "monadUserState-bytestring" wrapper

This section answers the question: "How do I include an Alex lexer in my program?"

Alex provides for a great deal of flexibility in how the lexer is exposed to the rest of the program. For instance, there's no need to parse a String directly if you have some special character-buffer operations that avoid the overheads of ordinary Haskell Strings. You might want Alex to keep track of the line and column number in the input text, or you might wish to do it yourself (perhaps you use a different tab width from the standard 8-columns, for example).

The general story is this: Alex provides a basic interface to the generated lexer (described in the next section), which you can use to parse tokens given an abstract input type with operations over it. You also have the option of including a wrapper, which provides a higher-level abstraction over the basic interface; Alex comes with several wrappers.

5.1. Unicode and UTF-8

Lexer specifications are written in terms of Unicode characters, but Alex works internally on a UTF-8 encoded byte sequence.

Depending on how you use Alex, the fact that Alex uses UTF-8 encoding internally may or may not affect you. If you use one of the wrappers (below) that takes input from a Haskell String, then the UTF-8 encoding is handled automatically. However, if you take input from a ByteString, then it is your responsibility to ensure that the input is properly UTF-8 encoded.

None of this applies if you used the --latin1 option to Alex. In that case, the input is just a sequence of 8-bit bytes, interpreted as characters in the Latin-1 character set.