TcLex documentation - Complementary information

Complementary information

Using regexps under Tcl 8.1

Scriptics maintains a very good regexp-HOWTO with many useful information on how to use regexps under Tcl 8.1.

Flex vs tcLex: a side-by-side comparison

TcLex borrowed many of its concepts to flex and tried to adapt them to the Tcl philosophy. There are then areas where tcLex and flex are really similar and interchangeable, while some features are enough different to prevent an easy move from one to another.

Syntax

The most obvious difference is the syntax: flex uses its own whereas tcLex uses Tcl's. TcLex tried to borrow as many existing Tcl constructs as possible to make it easy to use by Tcl programmers while allowing easy understanding by flex programmers. For example, tcLex borrowed its syntax to the switch command and most of its procedural model to proc. Flex uses specific syntax for rules specifiers and C syntax for actions (ie rules scripts). In some areas, tcLex has higher level features than flex.

TcLex uses list syntax for rules specifiers and thus requires that a fixed number of elements (ie 4 elements) be specified. Flex is more permissive because some specifiers can be omitted. For example, the conditions list defaults to empty in flex whereas it must be specified as an empty list in tcLex. Moreover, tcLex also needs a list of match variables used to report the matched parts of the input string, the same way regexp does. Here is a short example:

flex syntax:

<cond1>"\n" {
  /* matches newlines within condition cond1 */
}
"\n" {
  /* matches newlines with empty conditions specifier */
}

tcLex syntax:

{cond1} "\n" {} {
  # matches newlines within condition cond1
}
{} "\n" {} {
  /* matches newlines with empty conditions specifier */
}

Note the empty condition list in the second tcLex rule, as well as the (here empty) match variables list. Apart from that, flex and tcLex have very similar syntaxes.

The second visible difference between flex and tcLex is the latter requiring an extra list of match variables. This is a convenient way to get the matched parts of the input string, and is inspired by the regexp command syntax. Getting the matched string with flex requires using the yytext variable, and there is no subexpressions reporting (see Feedback below).

The third syntactical difference is tcLex's lack of name definitions. These are macro definitions that allows for simpler definition of flex's regular expressions. There is no name definition in tcLex because they can be handled by Tcl itself using variable substitutions. For example:

DIGIT    [0-9]
ID       [a-z][a-z0-9]*

%%

{DIGIT}+"."{DIGIT}*        {
	printf( "A float: %s (%g)\n", yytext,
			atof( yytext ) );
}

{DIGIT}+    {
	printf( "An integer: %s (%d)\n", yytext,
			atoi( yytext ) );
}

Note the two name definitions DIGIT and ID. They are subsequently used in the regular expressions to simplify their writing and also make them more readable. In tcLex variables must be used:

set DIGIT    {[0-9]}
set ID       {[a-z][a-z0-9]*}

lexer lx \
{} ${DIGIT}+.${DIGIT}* {text} {
	puts "A float: $text ([expr double($text)])"
} \
{} ${DIGIT}+ {text} {
	puts "An integer: $text ([expr int($text)])"
}

Note the use of variable substitutions in regexps. To make variable substitutions work, you also have to use the "rules-as-arguments" lexer style (see Lexing With Style): note the backslashes at every end of line. The other noticeable difference is the use of a variable named text to report the matched string. Flex uses yytext for that purpose. Be careful to respect Tcl's syntax when defining variables: here we have to enclose the name definitions between braces to avoid bracket substitution. The regexps are the same except for the added '$' for variable substition and the removal of inside double quotes. Both lexers should work the same.

Finally, there are subtle differences in regular expressions syntax (tcLex uses Tcl's syntax). First is the absence of trailing contexts in tcLex (syntax: r/s, where r is the regular expression and s is the trailing context). This shouldn't be a big problem though, unless when converting from flex to tcLex. Second is line-sensitivity. Flex uses line-sensitive regexps whereas Tcl uses line-insensitive by default. Since there is no support for line-sensitivity in Tcl8.0 and Tcl8.1 requires specific syntax, tcLex provides its own portable line-sensitivity through the use of the -lines switch. Line sensitivity changes the meaning of the special characters '^', '$' and '.' within regexps. With line-insensitive regexps, '^' and '$' respectively match the beginning and end of string, and '.' any character. With line-sensitive regexps, '^' and '$' respectively match the beginning and end of lines, and '.' any but the newline character.

Feedback

To report the matched string, flex uses the special string variable yytext. TcLex uses a quite different method that is closer to Tcl philosophy (ie the regexp command) and also more powerful in term of feedback. With flex, parentheses are only used to override precedence in regexps. However Tcl regexps also use parentheses for reporting subexpressions. To keep this feature, tcLex needs variables to report matched substrings, the same way regexp does. Thus there is no predefined yytext variable but per-rule user-defined variables. This can allow for elegant constructs when several rules share the same action. Let's consider the following example, where several syntactically different but conceptually similar constructs have to be matched:

lexer lexCComments {
  {} "(/\\*)([^*]*)\\*/"        {text style comment} -
  -  "(//)(([^\\\\\n]|\\\\.)*)" {text style comment} {
    # C/C++ comments
    switch -- $style {
      /* {set lang C}
      // {set lang C++}
    }
    puts "$lang comment matched: $comment"
  }
}

The two rules are used to match C and C++ comments. C comments are any characters enclosed between "/*" and "*/". C++ comments are any characters after "//" until the end of a line not ending with '\' (note the quoting hell required for '\' to work). Here the same action is used for two different rules, but using the same variables for reporting. This greatly simplifies rules writing as it allows using the same action for several similar yet different rules. Note that the order of the variables may differ.

Behavior

There are some behavioral differences between flex and tcLex, but most can be overriden by using specific tcLex switches.

As previously said, flex is line-sensitive whereas tcLex is line-insensitive by default (as Tcl). To enable line-sensitivity, use the -lines flag.

Second, flex uses a longest prefered match scheme whereas tcLex uses a first match scheme by default. Longest prefered match means that every rule will be tried and the longest will be used. First match means that the first matching rule will be used. This can lead to strange behavior when converting from flex to tcLex. For example, this simple lexer taken from the flex man page:

DIGIT    [0-9]
ID       [a-z][a-z0-9]*

%%

{DIGIT}+    {
	printf( "An integer: %s (%d)\n", yytext,
			atoi( yytext ) );
}

{DIGIT}+"."{DIGIT}*        {
	printf( "A float: %s (%g)\n", yytext,
			atof( yytext ) );
}

Converting straightforwardly this lexer to tcLex will give incorrect results:

set DIGIT    {[0-9]}
set ID       {[a-z][a-z0-9]*}

lexer lx \
{} ${DIGIT}+           {text} {
	puts "An integer: $text ([expr int($text)])"
} \
{} ${DIGIT}+.${DIGIT}* {text} {
	puts "A float: $text ([expr double($text)])"
}

Since tcLex uses first-match by default, the string "1.5" which would be seen as a float by flex (2nd rule) will be seen as an integer by tcLex because first rule is matched before the second. To avoid this behavior, either use the -longest flag or change the order of the rules. Using longest-match can hit performances when there is a large number of rules, but changing the order of the rules must also be done carefully. What to do will depend on the situation: simple lexers (like the one above) will behave better by changing order whereas more complex ones will be preferabily converted with -longest flag to avoid obscure bugs. Note that first-match scheme can also be useful to simplify rules definitions because one can take their precedence order into account, and can greatly speedup the processing. First-match scheme was chosed for tcLex because it is the way switch works.

The last behavioral difference lies in actions. With flex, actions often end with a return statement that returns a specific value to the calling context (typically a token returned to a yacc parser). TcLex lexers have a completely different behavior. Since lexers are seen as a mixture of Tcl switch and proc, a return statement has the same consequence as with these commands: stopping the processing. Thus it isn't possible yet to use tcLex lexers as tokenizers that return a value every time a rule succeeds. Future versions will certainly improve on this point and turn lexers into tokenizers. For now, the only way lexers can return a value during processing is through the incremental processing scheme (see Incremental processing). This choice was made to be consistent with proc and switch.

Other features

Flex provides many functions to access and/or modify the input buffer, such as C functions yymore(), yyless(), input() and unput(). TcLex provides similar volontarily limited features in the form of input and unput lexer subcommands. The reason for these limitations is that these features can be kludgy or can modify the input.

yymore() is used to append some characters to the next matched text. It can lead to very strange results in some cases and is very kludgy by nature. IMHO the fact that the flex man page uses the "mega-kludge" string in an example of yymore() use is not a coincidence :-)
input() is available unchanged in the form of the input subcommand.
yyless() is available in the form of the unput subcommand. Unput() is also available through this subcommand provided that it cannot modify the input buffer, ie unput an arbitrary character. I consider this to be a kludge like yymore() that somewhat violates Tcl philosophy as a high level language. If someone convinces me of the contrary, I'll consider adding such features in future versions ;-) Anyway I don't think they are fundamental but rather denote quick'n dirty design.

Features chart

	flex	tcLex
use	generates static C source files	dynamic creation of Tcl commands
syntax	specific	Tcl
name definitions	yes, specific syntax	via Tcl substitutions
match scheme	longest-prefered match only	both longest-prefered and first match
line-sensitivity	line-sensitive only	both line-sensitive and -insensitive
regular expressions	alike Tcl8.0 plus trailing contexts and characters classes	depend on Tcl version
reentrancy	limited	yes
subexpressions reporting	no	yes
tokenization	yes	no
incremental processing	must be specially designed	transparent
multiple input buffers	yes	no

Converting flex to tcLex: Rules of Thumb

If you plan to use tcLex as a Tcl replacement for flex, it is highly probable that you will have to convert a flex lexer to tcLex one day. Here are several rules of thumb to save time and avoid strange errors:

name definitions in flex are automatically enclosed between parentheses before they are expanded, unless they begin with '^' or end with '$', to avoid precedence problems in regexps. You have to be careful whether you need to add parentheses or not when converting to tcLex.
parentheses are only used by flex for precedence in regexps. In tcLex they are also used for reporting. This causes no special problem when directly converting from flex to tcLex, however be careful about parentheses if you later add subexpressions reporting.
be careful about Tcl substitutions and escape rules when converting regexps, else the evil Tcl parser will bite'ya ;-) Take care about backslash characters and braces (see the FAQs below).
flex uses longest-prefered-match scheme, tcLex uses first-match scheme by default. When converting, either use the -longest flag (safest but less performant), or change the order of the rules (only if you know what you're doing!)
flex uses line-sensitive regexp, tcLex is line-insensitive by default. Just use the -lines flag if you see that the lexer uses special line-sensitive constructs (ie special characters ^$.)
do not use return statements in actions: they cause the processing to stop. If there are return statements in your flex lexer, it probably means that the lexer is a tokenizer (used in conjunction with yacc). TcLex currently doesn't support tokenizer-style processing so this may need a bit work to convert.
input() in flex consumes the next character. The input subcommand in tcLex returns one character by default but can take more if requested. A single tcLex input can then be used to replace several subsequent input() calls.
replace yyless(n) by unput $n; it rewinds the input by n chars. Note that, contrary to flex, you cannot rewind past the beginning of the matched string.
unput() puts back a single character in the input stream. If this character is the same as the one that was extracted (by input()), then use unput which by default rewinds the input by one char. Else, you're out of luck :-(
there is no equivalent to yymore(). You can somewhat emulate its behavior by storing the yymore'd string in a variable (empty by default) and prepend it to the matched string in every action. Do it only if you can't do else.
by default, flex sends matched characters to the output. TcLex has no idea what the output can be (a string, a list, a tree...) so by default it does nothing. You can replace outputs (eg. ECHO statements) with puts calls, and add a default rule that outputs any unmatched character:
```
* .|\n {c} {puts -nonewline $c}
```

http://www.multimania.com/fbonnet/Tcl/tcLex/index.en.htm (english)
http://www.multimania.com/fbonnet/Tcl/tcLex/index.htm (french)

Mailing list:

Home page: http://www.eGroups.com/list/tclex
To subscribe: tclex-subscribe@egroups.com

Send questions, comments, requests, etc. to the author: Frédéric BONNET <frederic.bonnet@ciril.fr>.

Complementary information

Using regexps under Tcl 8.1

Flex vs tcLex: a side-by-side comparison

Syntax

Feedback

Behavior

Other features

Features chart

Converting flex to tcLex: Rules of Thumb

Tips & Tricks

How to...

Frequently Asked Questions (FAQs) & Frequently Made Mistakes (FMMs)

Resources