TcLex documentation - Using tcLex

Using tcLex

Loading the tcLex package

On most systems, tcLex takes the form of a loadable library that contains a Tcl package. To load tcLex, just include this piece of code in your Tcl scripts:

package require tcLex

Tcl's package managing facility also provides many useful features such as versionning. The current tcLex version is 1.1. The above line will load the tcLex package with the highest version number. If you want to ensure that the correct version is loaded (to avoid incompatibilities between versions), append the desired version number to the package command:

package require tcLex 1.1

will load any package whose version number is equal to or greater than 1.1. Adding the -exact switch tells that you want the package with the same version as specified:

package require -exact tcLex 1.1

This option should only be used in specific cases. Packages are supposed to be compatible with other packages with the same major version number but with lower minor.

Precise version control avoids loading a package that is incompatible with the desired one, and also check for the presence of this package. A good package command gives precious information to the user of a software about the packages it depends on.

If loading is successful, a new Tcl command becomes available in the interpreter: the lexer command. The reason why tcLex doesn't define a new namespace is that it creates only one new command whose name is unlikely to conflict with any existing one.

The lexer command

Synopsis

lexer option ?args ... args?

Description

The lexer command creates lexical analyzers (lexers) commands. It creates a new command named name, which can contain namespaces delimiters (::). The newly created command replaces any existing one.

Lexer's syntax is a mixture of the Tcl commands proc and switch. Like proc, it creates a new Tcl command. Like switch, such defined commands are built around a specific control structure similar to switch blocks.

Subcommands & Switches

Lexer takes a subcommand name as its first argument. This can be any of the following options:

create
```
lexer ?create? name ?option ...? rules
lexer ?create? name ?option ...? rule ?rule ...?
```
The create keyword is optional for compatibility with tcLex 1.0, but its use is strongly encouraged as it may become mandatory in future major versions and it also avoids any name conflict with lexers name.
The create subcommand, as its name says, is used to create new lexers. It creates a new command named name, which can contain namespaces delimiters (::). The newly created command replaces any existing one. The return value is empty. On these points, lexer create resembles proc.
Lexer takes a variable number of options, each starting by '-'. Some options take an extra argument as value. Any of these options can be abbreviated if the abbreviation is non ambiguous. The following options are currently supported:

-inclusiveconditions conditionsList
-ic conditionsList
This specifies a list of inclusive conditions that can be used to restrict the rules' field of application.

-exclusiveconditions conditionsList
-ec conditionsList
This specifies a list of exclusive conditions that can be used to restrict the rules' field of application.

-prescript script
-midscript script
-postscript script
These specifies Tcl scripts that will be evaluated at the beginning of the processing (for initialization), in the middle of incremental processing (for returning an intermediate result), and at the end of the processing (for returning a final result).

-resultvariable variableName
This specifies the name of a Tcl variable living in the lexer's contex, holding the result of the processing if no value is returned before its end.

-args args
This specifies the list of extra arguments that can be passed to lexers. It uses the same syntax as with proc.

-lines
This specifies that rule matching must be line-sensitive. This has an impact on the regexp syntax. In particular, the meaning of '^', '$' and '.' is changed. By default the matching is line-insensitive.

-nocase
This specifies that rule matching must be case-insensitive. By default the matching is case-sensitive.

-longest
This specifies that rule matching should prefer longest matching rules rather than first matching rule (the default).

--
Marks the end of options. The arguments following this one will be treated as rules.

Following the options list are the rules. They can be specified using two syntaxes (like switch). The first uses a separate argument for each of the rules specifiers; this form is convenient if substitutions are desired on some of the patterns or commands. The second form places all of the rules specifiers into a single argument; the argument must have proper list structure, with the elements of the list being the rules specifiers. The second form makes it easy to construct multiple rules, since the braces around the whole list make it unnecessary to include a backslash at the end of each line. Since the rules specifiers are in braces in the second form, no command or variable substitutions are performed on them; this makes the behavior of the second form different than the first form in some cases.
current
```
lexer current
```
This returns the name of the currently active lexer. This can be used as a convenience from within lexers script, and is most useful when lexers are renamed or aliased.

Rules specifiers

Lexers are made up of rules which define the behavior of the lexer depending on its input. Rules are specified using the following syntax (rules specifiers):

conditionlist regexp matchvars action

where:

conditionlist is a Tcl list of conditions. It can be equal to "-", in which case it specifies that we use the same list as the previous rule (if any, else it generates an error),
regexp is a regular expression,
matchvars is a list of variable names which will be filled by the matched substrings,
action is a Tcl script that is evaluated when the rule matches the input string. It can be set to "-", in which case it specifies that we use the same action as the next rule (the same way switch does).

Conditionlist is a list of conditions during which the rule is active. Every condition must have been specified by either -inclusiveconditions or -exclusiveconditions switches. If not, it raises an error.

Regexp is a regular expression that uses the same syntax as Tcl's regexp command. It can contain parenthesized subexpressions that correspond to different parts of the matched string. If the rule matches (ie if the rexexp matches) the input string, the variables specified by matchvars are filled with corresponding substrings (the same way regexp does, see its man page).

Action is a Tcl script that is evaluated when the rule matches the input string. This script is evaluated in the lexer's context, the same used by -(pre|mid|post)script and where the matchvars live.

Created lexers commands

Synopsis

lexerName option ?arg arg ...?

Description

The lexer creates new lexer commands that are used like regular Tcl commands. These command takes in turn subcommands that perform several operations.

Subcommands & Switches

Option and the args determine the exact behavior of the command. The following commands are possible for lexers:

Evaluation subcommands
These commands can be invoked from any script. They start a new processing sequence.

lexerName eval string ?arg ... arg?
Processes the specified string and return the lexing result. If an arguments list has been specified for the lexer (using -args), they are passed after string.

lexerName start tokenName string ?arg ... arg?
Processes the maximum number of characters from string, returns the partial lexing result and sets the variable tokenName to the state value used for further processing with continue or finish.

lexerName continue token string
Continues processing started with start, token being the value returned in the tokenName variable by a previous call to start. String is supposed to be larger than for the previous calls to start and continue.

lexerName finish token string
Ends processing started with start, token being the value returned in the tokenName variable by a previous call to start. Like eval, finish processes string to its end.
Internal subcommands
These commands can only be used during processing (when lexerName is active), else they raise an error.

lexerName begin condition
Enters the condition condition, which must be a valid condition (specified by -(inclusive|exclusive)conditions).

lexerName end
Leaves the current condition and returns to the previous.

lexerName conditions ?-current?
Returns the current conditions stack in the form of a Tcl list. The rightmost is the current, which can also be returned if the -current switch is used.

lexerName index
Returns the index of the currently lexed substring, ie the first character of the matched substring within the input string.

lexerName reject
Makes this rule behave as it didn't match at all. The action is evaluated to its end, and after that the lexer searches for another matching rule.

lexerName input ?nbChars?
Consumes nbChars (default to 1) characters from the input string and returns them. If there are less characters remaining in the string, returns all of them (returns an empty string if none remains). These characters won't be matched by future rules unless they are unput.

lexerName unput ?nbChars?
Puts nbChars (default to 1) characters back into the input string. You cannot unput characters past the beginning of the matched string. These characters will be made available for matching by future rules.

How lexers work

Basic Concepts

Lexers' main goal is to read and consume an input string, and to perform a number of tasks according to this input. To do so, lexers are built around a loop that globally performs the following jobs:

read the input string
try to match the input against a number of active rules, determined by the lexer's state
once a rule matches, evaluate its action which can modify the lexer's state
consume the matched chars and continue

Designing a lexer involves writing rules. Rules are made of several parts:

a list of conditions during which the rule is active
a regular expression that is matched against the input
an action script that performs actions whenever the rule matches

Lexers regular expressions are like standard regular expressions, and are covered by the Tcl regexp man page.

Actions are standard Tcl scripts that can do anything, including calling the lexer's subcommands to query or modify its state.

Conditions are one of the key features of lexers, because they tell when a rule is active and can be tried. Basically, conditions are just states that lexers enter and leave. There can be only one active condition for a given active lexer. Thus, to be active, a rule must have the current condition in its active conditions list.

Conditions

In tcLex, the list of valid conditions must be given at lexer creation time. The options -inclusiveconditions (or -ic) and -exclusiveconditions (or -ec) are used to specify the list of inclusive or exclusive conditions that can be used by the lexer. In turn, every rule must specify a list of conditions during which they are active. Conditions are just arbitrary strings.

During processing, the current condition can change. The lexer subcommand begin tells the lexer to enter the specified condition, whereas end tells it to leave the current condition and return to the previous. Conditions are managed in a stack structure, which can be queried using the conditions subcommand.

As written above, there are two kinds of conditions: inclusive and exclusive. At the first sight, the difference between inclusive and exclusive conditions isn't very obvious, and especially how and why they are used. Let's see the following example:

lexer lexclusive -exclusiveconditions {foo} {
  foo {b} {char} {puts "$char matched within foo"}
  {}  a   {char} {puts "$char matched, entering foo"; [lexer current] begin foo}
  {}  .   {char} {puts "$char matched"}
}

lexer linclusive -inclusiveconditions {foo} {
  foo {b} {char} {puts "$char matched within foo"}
  {}  a   {char} {puts "$char matched, entering foo"; [lexer current] begin foo}
  {}  .   {char} {puts "$char matched"}
}

We see that the only difference between both lexers is the condition "foo" being inclusive or exclusive. This slight difference in the definition can make a huge difference in the behavior. We can observe that if we eval both lexers with the same simple string:

% lexclusive eval abcd
a matched, entering foo
b matched within foo
%
% linclusive eval abcd
a matched, entering foo
b matched within foo
c matched
d matched

Here we see that the last 2 characters of the string, 'c' and 'd', get matched by linclusive but not by lexclusive. The reason is that after character 'b' has been matched, both lexer enter the condition "foo", which is exclusive in lexclusive and inclusive in linclusive. When both lexers try to match the next characters 'c' and 'd', lexclusive can't find an appliable rule so evals the default rule (which does nothing, see below Default rule handling), whereas linclusive can apply the rules with empty conditions list even within condition "foo". So the main difference between inclusive and exclusive conditions is that inclusive conditions activates rule with empty conditions list.

Since inclusive and exclusive conditions yelds slightly different results, they are often used in specific, distinct situations. Exclusive conditions are used to write "sub-lexers" to apply a different set of rules to some part of the string. Inclusive conditions are used to override some rules in some situations while maintaining the overall behavior. A good example is the Tcl language itself. Tcl parsing rules are slightly different whether the strings are enclosed between brackets, braces or quotes, but many rules still apply (backslash substitution for example). These rules could be expressed as rules with empty conditions list that are still active within inclusive conditions.

Processing

As stated before, lexers are built around a main processing loop, which goal is to consume the whole input string. To do so, it tries to find a matching rule, and if it succeeds, it evals its action and consume the matched chars.

Since actions can contain arbitrary command, they can modify the lexer's state. Thus, this state may differ before and after the action has been evaluate. The lexer's state can be modified on some points:

changing the active condition (begin and end),
rejecting the rule (reject),
acting on the input (input and unput).

Changing the active is the most simple case. Rejection and input modification are more complex.

Any matching rule can be rejected from within its action. Rejection means that the lexer must behave as if the rule didn't match at all. So rejecting a rule tells the lexer to go on looking for a matching rule. The complexity comes from the fact that the rejected rule can modify the lexer's state before.

Rejection is very useful in a number of cases where normal rule handling would overcomplicate the lexer. Let's consider the following example:

lexer count \
-pre {
  set nbLines 1
  set nbChars 0
} \
-post {
  puts "number of lines: $nbLines"
  puts "number of chars: $nbChars"
} {
  {} \n {} {incr nbLines; [lexer current] reject}
  {} .  {} {incr nbChars}
}

This simple lexer is a character and line counter for strings. At the end of the processing, it prints the total number of lines and characters in the input string. The first line is used to match and count the newline characters. The second counts characters. But since a newline is also a character, it must be counted as such. A first possibility is to add the character counting code to the line counting rule. But although it seems useless in this simple example, it can be very tedious and error-prone in more complex lexers (for instance if the character-counting rule is rewritten to add new features), as duplicating code should be avoided. The second possibility is using rejection. In this example we tell the lexer to reject the line-counting rule, so that it proceeds to the next matching rule which is the character-counting rule. That way, newline chars can be used for line counting while still being seen as regular characters. The real power with reject is that it can be conditionally triggered and cascaded. This provides some sort of inheritance between rules.

The last way to act on the lexer state is acting on the input. The input string can't be modified (unlike with flex), but its consumption can be modified. The lexer subcommands input and unput are respectively used to get extra characters from, and to put back consumed characters to the input. This can allow for dynamic rule writing where characters are not consumed by rule matching but programmatically. A practical use would be to write a rule that only matches the beginning of a sequence, and then manually inputs extra characters in order to perform conditional action. This can help a lot grouping many similar rules into one generic rule. Since such consumed characters can be put back by and rules can be rejected, this provide a very powerful way to write dynamic rules.

Default rule handling

Flex defines a default rule that is evaluated when none matched. Its default behavior is to consume the first unmatched character and copy it to the output. TcLex lexers' default behavior is slightly different: it consumes one char and does nothing else. Literally, the default rule is:

* . {} {}

(or "* .|\n {} {}" when line-sensitive matching is used)

This is because lexers have no idea what the "output" can be. The result can take any form and its handling is not tcLex's responsibility but the lexer writer's. You can override the default behavior by defining your own default rule.

Incremental processing

One of the most powerful features of tcLex is incremental processing. In short, it allows a lexer to be interrupted and resumed depending on the availability of input data. Given the non-threaded, event-based nature of the Tcl language, it thus provides a very powerful data-driven processing scheme. Applications range from interactive user input processing to socket-delivered document parsing (eg. HTML).

The main strength of tcLex's incremental processing facility is its transparency. Lexers need not be specially designed (except from some performance-related precautions and some specific limtations, see Bugs and Limitations below) to be used incrementally, processing scheme being chosen at run-time. Lexers keep state information between sessions, thus allowing concurrent use of several lexing processes. This state processing is completely transparent to users, and such lexers behave like interruptible procs that could keep their state alive -- local variables for example.

The key to incremental processing lies in the three dedicated lexers subcommands: start, continue and finish. A typical incremental processing session is initiated by start, then additional data is processed by further continue commands, finally processing terminates by finish when no more data is available. Each subcommand returns a result that can be partial (start, continue) or final (finish). TcLex's incremental processing is designed so that the final result of an incremental processing session (returned by finish) is the same as the result of a one-pass session (using eval command) on the same complete string.

As an example is worth a thousand words, here is a typical incremental processing session:

# create a lexer named lx
lexer create lx ...

# channel contains the name of an open Tcl channel (eg. file or socket)

# first initiate the process with an empty string. We could also give
# a non-empty string taken from the input channel.
set input ""
lx start token $input
# then get strings from the channel
while {![eof $channel]} {
  # update the input string
  append input "[gets $channel]\n" ;# don't forget the trailing newline
  lx continue $token $input
}
# all data has been read, finish
# result should be equal to [lx eval $input]
set result [lx finish $token $input]

Every start/continue command should return a valid partial result. This could be for example a partial HTML tree ready to be rendered in a web browser. These results are returned by Tcl scripts specified by -midscript, or through a -resultvariable-specified Tcl variable. These can be used to generate a well-formatted result from a partial internal result (eg. an uncomplete tree) so that it can be properly used by outer code.

Another example, this time chunk-oriented rather than line-oriented (using read rather than gets, ie reading a fixed number of chars rather than whole lines):

# create a lexer named lx
lexer create lx ...

# channel contains the name of an open Tcl channel (eg. file or socket)
set chunkSize 1024 ;# number of chars to read at a time

# first initiate the process with an empty string. We could also give
# a non-empty string taken from the input channel.
set input ""
lx start token $input
# then get strings from the channel
while {![eof $channel]} {
  # update the input string
  append input [read $channel $chunkSize]
  lx continue $token $input
}
# all data has been read, finish
# result should be equal to [lx eval $input]
set result [lx finish $token $input]

Bugs and Limitations

Using the input subcommand during incremental processing is potentially dangerous: input cannot return more characters than available at the time the command is issued. Thus, if input wants more characters than available, it will get less characters than during one-pass processing. This contradicts the principle that one-pass and incremental processing yields identical results. Correcting this problem would require a wholely redesigned incremental processing scheme (event-driven for instance, where lexers would get the input themselves rather than be manually called via start/continue/finish).

Send questions, comments, requests, etc. to the author: Frédéric BONNET <frederic.bonnet@ciril.fr>.