5.3. Wrappers

5.3. Wrappers
Prev	Chapter 5. The Interface to an Alex-generated lexer	Next

To use one of the provided wrappers, include the following declaration in your file:

%wrapper "name"

where name is the name of the wrapper, eg. basic. The following sections describe each of the wrappers that come with Alex.

5.3.1. The "basic" wrapper

The basic wrapper is a good way to obtain a function of type String -> [token] from a lexer specification, with little fuss.

It provides definitions for AlexInput, alexGetByte and alexInputPrevChar that are suitable for lexing a String input. It also provides a function alexScanTokens which takes a String input and returns a list of the tokens it contains.

The basic wrapper provides no support for using startcodes; the initial startcode is always set to zero.

Here is the actual code included in the lexer when the basic wrapper is selected:

type AlexInput = (Char,      -- previous char
                  [Byte],    -- rest of the bytes for the current char
                  String)    -- rest of the input string

alexGetByte :: AlexInput -> Maybe (Byte,AlexInput)
alexGetByte (c,(b:bs),s) = Just (b,(c,bs,s))
alexGetByte (c,[],[])    = Nothing
alexGetByte (_,[],(c:s)) = case utf8Encode c of
                             (b:bs) -> Just (b, (c, bs, s))

alexInputPrevChar :: AlexInput -> Char
alexInputPrevChar (c,_) = c

-- alexScanTokens :: String -> [token]
alexScanTokens str = go ('\n',[],str)
  where go inp@(_,_bs,str) =
          case alexScan inp 0 of
                AlexEOF -> []
                AlexError _ -> error "lexical error"
                AlexSkip  inp' len     -> go inp'
                AlexToken inp' len act -> act (take len str) : go inp'

The type signature for alexScanTokens is commented out, because the token type is unknown. All of the actions in your lexical specification should have type:

{ ... } :: String -> token

for some type token.

For an example of the use of the basic wrapper, see the file examples/Tokens.x in the Alex distribution.

5.3.2. The "posn" wrapper

The posn wrapper provides slightly more functionality than the basic wrapper: it keeps track of line and column numbers of tokens in the input text.

The posn wrapper provides the following, in addition to the straightforward definitions of alexGetByte and alexInputPrevChar:

data AlexPosn = AlexPn !Int  -- absolute character offset
                       !Int  -- line number
                       !Int  -- column number

type AlexInput = (AlexPosn,     -- current position,
                  Char,         -- previous char
                  [Byte],       -- rest of the bytes for the current char
                  String)       -- current input string

--alexScanTokens :: String -> [token]
alexScanTokens str = go (alexStartPos,'\n',[],str)
  where go inp@(pos,_,_,str) =
          case alexScan inp 0 of
                AlexEOF -> []
                AlexError ((AlexPn _ line column),_,_,_) -> error $ "lexical error at " ++ (show line) ++ " line, " ++ (show column) ++ " column"
                AlexSkip  inp' len     -> go inp'
                AlexToken inp' len act -> act pos (take len str) : go inp'

The types of the token actions should be:

{ ... } :: AlexPosn -> String -> token

For an example using the posn wrapper, see the file examples/Tokens_posn.x in the Alex distribution.

5.3.3. The "monad" wrapper

The monad wrapper is the most flexible of the wrappers provided with Alex. It includes a state monad which keeps track of the current input and text position, and the startcode. It is intended to be a template for building your own monads - feel free to copy the code and modify it to build a monad with the facilities you need.

data AlexState = AlexState {
        alex_pos :: !AlexPosn,  -- position at current input location
        alex_inp :: String,     -- the current input
        alex_chr :: !Char,      -- the character before the input
        alex_bytes :: [Byte],   -- rest of the bytes for the current char
        alex_scd :: !Int        -- the current startcode
    }

newtype Alex a = Alex { unAlex :: AlexState
                               -> Either String (AlexState, a) }

runAlex          :: String -> Alex a -> Either String a

alexGetInput     :: Alex AlexInput
alexSetInput     :: AlexInput -> Alex ()

alexError        :: String -> Alex a

alexGetStartCode :: Alex Int
alexSetStartCode :: Int -> Alex ()

To invoke a scanner under the monad wrapper, use alexMonadScan:

alexMonadScan :: Alex result

The token actions should have the following type:

type AlexAction result = AlexInput -> Int -> Alex result
{ ... }  :: AlexAction result

The Alex file must also define a function alexEOF, which will be executed on when the end-of-file is scanned:

alexEOF :: Alex result

The monad wrapper also provides some useful combinators for constructing token actions:

-- skip :: AlexAction result
skip input len = alexMonadScan

-- andBegin :: AlexAction result -> Int -> AlexAction result
(act `andBegin` code) input len = do alexSetStartCode code; act input len

-- begin :: Int -> AlexAction result
begin code = skip `andBegin` code

-- token :: (String -> Int -> token) -> AlexAction token
token t input len = return (t input len)

5.3.4. The "monadUserState" wrapper

The monadUserState wrapper is built upon the monad wrapper. It includes a reference to a type which must be defined in the user's program, AlexUserState, and a call to an initialization function which must also be defined in the user's program, alexInitUserState. It gives great flexibility because it is now possible to add any needed information and carry it during the whole lexing phase.

The generated code is the same as in the monad wrapper, except in 3 places:

1) The definition of the general state, which now refers to a type (AlexUserState) that must be defined in the Alex file.

data AlexState = AlexState {
        alex_pos :: !AlexPosn,  -- position at current input location
        alex_inp :: String,     -- the current input
        alex_chr :: !Char,      -- the character before the input
        alex_bytes :: [Byte],   -- rest of the bytes for the current char
        alex_scd :: !Int,       -- the current startcode
        alex_ust :: AlexUserState -- AlexUserState will be defined in the user program
    }

2) The initialization code, where a user-specified routine (alexInitUserState) will be called.

runAlex :: String -> Alex a -> Either String a
runAlex input (Alex f) 
   = case f (AlexState {alex_pos = alexStartPos,
                        alex_inp = input,       
                        alex_chr = '\n',
                        alex_bytes = [],
                        alex_ust = alexInitUserState,
                        alex_scd = 0}) of Left msg -> Left msg
                                          Right ( _, a ) -> Right a

3) Two helper functions (alexGetUserState and alexSetUserState) are defined.

alexGetUserState :: Alex AlexUserState
alexSetUserState :: AlexUserState -> Alex ()

Here is an example of code in the user's Alex file defining the type and function:

data AlexUserState = AlexUserState
                   {
                       lexerCommentDepth  :: Int
                     , lexerStringValue   :: String
                   }

alexInitUserState :: AlexUserState
alexInitUserState = AlexUserState
                   {
                       lexerCommentDepth  = 0
                     , lexerStringValue   = ""
                   }

getLexerCommentDepth :: Alex Int
getLexerCommentDepth = do ust <- alexGetUserState; return (lexerCommentDepth ust)

setLexerCommentDepth :: Int -> Alex ()
setLexerCommentDepth ss = do ust <- alexGetUserState; alexSetUserState ust{lexerCommentDepth=ss}

getLexerStringValue :: Alex String
getLexerStringValue = do ust <- alexGetUserState; return (lexerStringValue ust)

setLexerStringValue :: String -> Alex ()
setLexerStringValue ss = do ust <- alexGetUserState; alexSetUserState ust{lexerStringValue=ss}

addCharToLexerStringValue :: Char -> Alex ()
addCharToLexerStringValue c = do ust <- alexGetUserState; alexSetUserState ust{lexerStringValue=c:(lexerStringValue ust)}

5.3.5. The "gscan" wrapper

The gscan wrapper is provided mainly for historical reasons: it exposes an interface which is very similar to that provided by Alex version 1.x. The interface is intended to be very general, allowing actions to modify the startcode, and pass around an arbitrary state value.

alexGScan :: StopAction state result -> state -> String -> result

type StopAction state result 
         = AlexPosn -> Char -> String -> (Int,state) -> result

The token actions should all have this type:

{ ... }      :: AlexPosn                -- token position
             -> Char                    -- previous character
             -> String                  -- input string at token
             -> Int                     -- length of token
             -> ((Int,state) -> result) -- continuation
             -> (Int,state)             -- current (startcode,state)
             -> result

5.3.6. The bytestring wrappers

The basic-bytestring, posn-bytestring and monad-bytestring wrappers are variations on the basic, posn and monad wrappers that use lazy ByteStrings as the input and token types instead of an ordinary String.

The point of using these wrappers is that ByteStrings provide a more memory efficient representation of an input stream. They can also be somewhat faster to process. Note that using these wrappers adds a dependency on the ByteString modules, which live in the bytestring package (or in the base package in ghc-6.6)

As mentioned earlier (Section 5.1, “Unicode and UTF-8”), Alex lexers internally process a UTF-8 encoded string of bytes. This means that the ByteString supplied as input when using one of the ByteString wrappers should be UTF-8 encoded (or use the --latin1 option).

Do note that token provides a lazy ByteString which is not the most compact representation for short strings. You may want to convert to a strict ByteString or perhaps something more compact still. Note also that by default tokens share space with the input ByteString which has the advantage that it does not need to make a copy but it also prevents the input from being garbage collected. It may make sense in some applications to use ByteString's copy function to unshare tokens that will be kept for a long time, to allow the original input to be collected.

5.3.6.1. The "basic-bytestring" wrapper

The basic-bytestring wrapper is the same as the basic wrapper but with lazy ByteString instead of String:

import qualified Data.ByteString.Lazy as ByteString

type AlexInput = (Char,       -- previous char
                  ByteString.ByteString) -- current input string

alexGetByte :: AlexInput -> Maybe (Char,AlexInput)

alexInputPrevChar :: AlexInput -> Char

-- alexScanTokens :: String -> [token]

All of the actions in your lexical specification should have type:

{ ... } :: ByteString.ByteString -> token

for some type token.

5.3.6.2. The "posn-bytestring" wrapper

The posn-bytestring wrapper is the same as the posn wrapper but with lazy ByteString instead of String:

import qualified Data.ByteString.Lazy as ByteString

type AlexInput = (AlexPosn,   -- current position,
                  Char,       -- previous char
                  ByteString.ByteString) -- current input string

-- alexScanTokens :: ByteString.ByteString -> [token]

All of the actions in your lexical specification should have type:

{ ... } :: AlexPosn -> ByteString.ByteString -> token

for some type token.

5.3.6.3. The "monad-bytestring" wrapper

The monad-bytestring wrapper is the same as the monad wrapper but with lazy ByteString instead of String:

import qualified Data.ByteString.Lazy as ByteString

data AlexState = AlexState {
        alex_pos :: !AlexPosn,  -- position at current input location
        alex_inp :: ByteString.ByteString, -- the current input
        alex_chr :: !Char,      -- the character before the input
        alex_scd :: !Int        -- the current startcode
    }

newtype Alex a = Alex { unAlex :: AlexState
                               -> Either String (AlexState, a) }

runAlex          :: ByteString.ByteString -> Alex a -> Either String a

-- token :: (ByteString.ByteString -> Int -> token) -> AlexAction token

All of the actions in your lexical specification have the same type as in the monad wrapper. It is only the types of the function to run the monad and the type of the token function that change.

5.3.6.4. The "monadUserState-bytestring" wrapper

The monadUserState-bytestring wrapper is the same as the monadUserState wrapper but with lazy ByteString instead of String:

import qualified Data.ByteString.Lazy as ByteString

ata AlexState = AlexState {
        alex_pos :: !AlexPosn,  -- position at current input location
        alex_inp :: ByteString.ByteString, -- the current input
        alex_chr :: !Char,      -- the character before the input
        alex_scd :: !Int        -- the current startcode
      , alex_ust :: AlexUserState -- AlexUserState will be defined in the user program
    }

newtype Alex a = Alex { unAlex :: AlexState
                               -> Either String (AlexState, a) }

runAlex          :: ByteString.ByteString -> Alex a -> Either String a

-- token :: (ByteString.ByteString -> Int -> token) -> AlexAction token

All of the actions in your lexical specification have the same type as in the monadUserState wrapper. It is only the types of the function to run the monad and the type of the token function that change.

Prev	Up	Next
5.2. Basic interface	Home	Chapter 6. Invoking Alex