To use one of the provided wrappers, include the following declaration in your file:
%wrapper "name
"
where name
is the name of the
wrapper, eg. basic
. The following sections
describe each of the wrappers that come with Alex.
The basic wrapper is a good way to obtain a function of
type String -> [token]
from a lexer
specification, with little fuss.
It provides definitions for
AlexInput
, alexGetByte
and alexInputPrevChar
that are suitable for
lexing a String
input. It also provides a
function alexScanTokens
which takes a
String
input and returns a list of the
tokens it contains.
The basic
wrapper provides no support
for using startcodes; the initial startcode is always set to
zero.
Here is the actual code included in the lexer when the basic wrapper is selected:
type AlexInput = (Char, -- previous char [Byte], -- rest of the bytes for the current char String) -- rest of the input string alexGetByte :: AlexInput -> Maybe (Byte,AlexInput) alexGetByte (c,(b:bs),s) = Just (b,(c,bs,s)) alexGetByte (c,[],[]) = Nothing alexGetByte (_,[],(c:s)) = case utf8Encode c of (b:bs) -> Just (b, (c, bs, s)) alexInputPrevChar :: AlexInput -> Char alexInputPrevChar (c,_) = c -- alexScanTokens :: String -> [token] alexScanTokens str = go ('\n',[],str) where go inp@(_,_bs,str) = case alexScan inp 0 of AlexEOF -> [] AlexError _ -> error "lexical error" AlexSkip inp' len -> go inp' AlexToken inp' len act -> act (take len str) : go inp'
The type signature for alexScanTokens
is commented out, because the token
type is
unknown. All of the actions in your lexical specification
should have type:
{ ... } :: String -> token
for some type token
.
For an example of the use of the basic wrapper, see the
file examples/Tokens.x
in the Alex
distribution.
The posn wrapper provides slightly more functionality than the basic wrapper: it keeps track of line and column numbers of tokens in the input text.
The posn wrapper provides the following, in addition to
the straightforward definitions of
alexGetByte
and
alexInputPrevChar
:
data AlexPosn = AlexPn !Int -- absolute character offset !Int -- line number !Int -- column number type AlexInput = (AlexPosn, -- current position, Char, -- previous char [Byte], -- rest of the bytes for the current char String) -- current input string --alexScanTokens :: String -> [token] alexScanTokens str = go (alexStartPos,'\n',[],str) where go inp@(pos,_,_,str) = case alexScan inp 0 of AlexEOF -> [] AlexError ((AlexPn _ line column),_,_,_) -> error $ "lexical error at " ++ (show line) ++ " line, " ++ (show column) ++ " column" AlexSkip inp' len -> go inp' AlexToken inp' len act -> act pos (take len str) : go inp'
The types of the token actions should be:
{ ... } :: AlexPosn -> String -> token
For an example using the posn
wrapper, see the file
examples/Tokens_posn.x
in the Alex
distribution.
The monad
wrapper is the most
flexible of the wrappers provided with Alex. It includes a
state monad which keeps track of the current input and text
position, and the startcode. It is intended to be a template
for building your own monads - feel free to copy the code and
modify it to build a monad with the facilities you
need.
data AlexState = AlexState { alex_pos :: !AlexPosn, -- position at current input location alex_inp :: String, -- the current input alex_chr :: !Char, -- the character before the input alex_bytes :: [Byte], -- rest of the bytes for the current char alex_scd :: !Int -- the current startcode } newtype Alex a = Alex { unAlex :: AlexState -> Either String (AlexState, a) } runAlex :: String -> Alex a -> Either String a alexGetInput :: Alex AlexInput alexSetInput :: AlexInput -> Alex () alexError :: String -> Alex a alexGetStartCode :: Alex Int alexSetStartCode :: Int -> Alex ()
To invoke a scanner under the monad
wrapper, use alexMonadScan
:
alexMonadScan :: Alex result
The token actions should have the following type:
type AlexAction result = AlexInput -> Int -> Alex result { ... } :: AlexAction result
The Alex file must also define a function
alexEOF
, which will be executed on when the
end-of-file is scanned:
alexEOF :: Alex result
The monad
wrapper also provides some
useful combinators for constructing token actions:
-- skip :: AlexAction result skip input len = alexMonadScan -- andBegin :: AlexAction result -> Int -> AlexAction result (act `andBegin` code) input len = do alexSetStartCode code; act input len -- begin :: Int -> AlexAction result begin code = skip `andBegin` code -- token :: (String -> Int -> token) -> AlexAction token token t input len = return (t input len)
The monadUserState
wrapper is built
upon the monad
wrapper. It includes a reference
to a type which must be defined in the user's program,
AlexUserState
, and a call to an initialization
function which must also be defined in the user's program,
alexInitUserState
. It gives great flexibility
because it is now possible to add any needed information and carry
it during the whole lexing phase.
The generated code is the same as in the monad
wrapper, except in 3 places:
1) The definition of the general state, which now refers to a
type (AlexUserState
) that must be defined in the Alex file.
data AlexState = AlexState { alex_pos :: !AlexPosn, -- position at current input location alex_inp :: String, -- the current input alex_chr :: !Char, -- the character before the input alex_bytes :: [Byte], -- rest of the bytes for the current char alex_scd :: !Int, -- the current startcode alex_ust :: AlexUserState -- AlexUserState will be defined in the user program }
2) The initialization code, where a user-specified routine (alexInitUserState
) will be
called.
runAlex :: String -> Alex a -> Either String a runAlex input (Alex f) = case f (AlexState {alex_pos = alexStartPos, alex_inp = input, alex_chr = '\n', alex_bytes = [], alex_ust = alexInitUserState, alex_scd = 0}) of Left msg -> Left msg Right ( _, a ) -> Right a
3) Two helper functions (alexGetUserState
and alexSetUserState
) are defined.
alexGetUserState :: Alex AlexUserState alexSetUserState :: AlexUserState -> Alex ()
Here is an example of code in the user's Alex file defining the type and function:
data AlexUserState = AlexUserState { lexerCommentDepth :: Int , lexerStringValue :: String } alexInitUserState :: AlexUserState alexInitUserState = AlexUserState { lexerCommentDepth = 0 , lexerStringValue = "" } getLexerCommentDepth :: Alex Int getLexerCommentDepth = do ust <- alexGetUserState; return (lexerCommentDepth ust) setLexerCommentDepth :: Int -> Alex () setLexerCommentDepth ss = do ust <- alexGetUserState; alexSetUserState ust{lexerCommentDepth=ss} getLexerStringValue :: Alex String getLexerStringValue = do ust <- alexGetUserState; return (lexerStringValue ust) setLexerStringValue :: String -> Alex () setLexerStringValue ss = do ust <- alexGetUserState; alexSetUserState ust{lexerStringValue=ss} addCharToLexerStringValue :: Char -> Alex () addCharToLexerStringValue c = do ust <- alexGetUserState; alexSetUserState ust{lexerStringValue=c:(lexerStringValue ust)}
The gscan
wrapper is provided mainly
for historical reasons: it exposes an interface which is very
similar to that provided by Alex version 1.x. The interface
is intended to be very general, allowing actions to modify the
startcode, and pass around an arbitrary state value.
alexGScan :: StopAction state result -> state -> String -> result type StopAction state result = AlexPosn -> Char -> String -> (Int,state) -> result
The token actions should all have this type:
{ ... } :: AlexPosn -- token position -> Char -- previous character -> String -- input string at token -> Int -- length of token -> ((Int,state) -> result) -- continuation -> (Int,state) -- current (startcode,state) -> result
The basic-bytestring
,
posn-bytestring
and
monad-bytestring
wrappers are variations on the
basic
, posn
and
monad
wrappers that use lazy
ByteString
s as the input and token types instead of
an ordinary String
.
The point of using these wrappers is that
ByteString
s provide a more memory efficient
representation of an input stream. They can also be somewhat faster to
process. Note that using these wrappers adds a dependency
on the ByteString
modules, which live in the
bytestring
package (or in the
base
package in ghc-6.6
)
As mentioned earlier (Section 5.1, “Unicode and UTF-8”), Alex
lexers internally process a UTF-8 encoded string of bytes.
This means that the ByteString
supplied
as input when using one of the ByteString wrappers should be
UTF-8 encoded (or use the --latin1
option).
Do note that token
provides a
lazy ByteString
which is not
the most compact representation for short strings. You may want to
convert to a strict ByteString
or perhaps something
more compact still. Note also that by default tokens share space with
the input ByteString
which has the advantage that it
does not need to make a copy but it also prevents the input from being
garbage collected. It may make sense in some applications to use
ByteString
's copy
function to
unshare tokens that will be kept for a long time, to allow the original
input to be collected.
The basic-bytestring
wrapper is the same as
the basic
wrapper but with lazy
ByteString
instead of String
:
import qualified Data.ByteString.Lazy as ByteString type AlexInput = (Char, -- previous char ByteString.ByteString) -- current input string alexGetByte :: AlexInput -> Maybe (Char,AlexInput) alexInputPrevChar :: AlexInput -> Char -- alexScanTokens :: String -> [token]
All of the actions in your lexical specification should have type:
{ ... } :: ByteString.ByteString -> token
for some type token
.
The posn-bytestring
wrapper is the same as
the posn
wrapper but with lazy
ByteString
instead of String
:
import qualified Data.ByteString.Lazy as ByteString type AlexInput = (AlexPosn, -- current position, Char, -- previous char ByteString.ByteString) -- current input string -- alexScanTokens :: ByteString.ByteString -> [token]
All of the actions in your lexical specification should have type:
{ ... } :: AlexPosn -> ByteString.ByteString -> token
for some type token
.
The monad-bytestring
wrapper is the same as
the monad
wrapper but with lazy
ByteString
instead of String
:
import qualified Data.ByteString.Lazy as ByteString data AlexState = AlexState { alex_pos :: !AlexPosn, -- position at current input location alex_inp :: ByteString.ByteString, -- the current input alex_chr :: !Char, -- the character before the input alex_scd :: !Int -- the current startcode } newtype Alex a = Alex { unAlex :: AlexState -> Either String (AlexState, a) } runAlex :: ByteString.ByteString -> Alex a -> Either String a -- token :: (ByteString.ByteString -> Int -> token) -> AlexAction token
All of the actions in your lexical specification
have the same type as in the monad
wrapper. It is
only the types of the function to run the monad and the type of the
token
function that change.
The monadUserState-bytestring
wrapper is the same as
the monadUserState
wrapper but with lazy
ByteString
instead of String
:
import qualified Data.ByteString.Lazy as ByteString ata AlexState = AlexState { alex_pos :: !AlexPosn, -- position at current input location alex_inp :: ByteString.ByteString, -- the current input alex_chr :: !Char, -- the character before the input alex_scd :: !Int -- the current startcode , alex_ust :: AlexUserState -- AlexUserState will be defined in the user program } newtype Alex a = Alex { unAlex :: AlexState -> Either String (AlexState, a) } runAlex :: ByteString.ByteString -> Alex a -> Either String a -- token :: (ByteString.ByteString -> Int -> token) -> AlexAction token
All of the actions in your lexical specification
have the same type as in the monadUserState
wrapper. It is
only the types of the function to run the monad and the type of the
token
function that change.