Scriptics maintains a very good regexp-HOWTO with many useful information on how to use regexps under Tcl 8.1.
TcLex borrowed many of its concepts to flex and tried to adapt them to the Tcl philosophy. There are then areas where tcLex and flex are really similar and interchangeable, while some features are enough different to prevent an easy move from one to another.
The most obvious difference is the syntax: flex uses its own whereas tcLex uses Tcl's. TcLex tried to borrow as many existing Tcl constructs as possible to make it easy to use by Tcl programmers while allowing easy understanding by flex programmers. For example, tcLex borrowed its syntax to the switch command and most of its procedural model to proc. Flex uses specific syntax for rules specifiers and C syntax for actions (ie rules scripts). In some areas, tcLex has higher level features than flex.
TcLex uses list syntax for rules specifiers and thus requires that a fixed number of elements (ie 4 elements) be specified. Flex is more permissive because some specifiers can be omitted. For example, the conditions list defaults to empty in flex whereas it must be specified as an empty list in tcLex. Moreover, tcLex also needs a list of match variables used to report the matched parts of the input string, the same way regexp does. Here is a short example:
<cond1>"\n" { /* matches newlines within condition cond1 */ } "\n" { /* matches newlines with empty conditions specifier */ }
{cond1} "\n" {} { # matches newlines within condition cond1 } {} "\n" {} { /* matches newlines with empty conditions specifier */ }
Note the empty condition list in the second tcLex rule, as well as the (here empty) match variables list. Apart from that, flex and tcLex have very similar syntaxes.
The second visible difference between flex and tcLex is the latter requiring an extra list of match variables. This is a convenient way to get the matched parts of the input string, and is inspired by the regexp command syntax. Getting the matched string with flex requires using the yytext variable, and there is no subexpressions reporting (see Feedback below).
The third syntactical difference is tcLex's lack of name definitions. These are macro definitions that allows for simpler definition of flex's regular expressions. There is no name definition in tcLex because they can be handled by Tcl itself using variable substitutions. For example:
DIGIT [0-9] ID [a-z][a-z0-9]* %% {DIGIT}+"."{DIGIT}* { printf( "A float: %s (%g)\n", yytext, atof( yytext ) ); } {DIGIT}+ { printf( "An integer: %s (%d)\n", yytext, atoi( yytext ) ); }
Note the two name definitions DIGIT and ID. They are subsequently used in the regular expressions to simplify their writing and also make them more readable. In tcLex variables must be used:
set DIGIT {[0-9]} set ID {[a-z][a-z0-9]*} lexer lx \ {} ${DIGIT}+.${DIGIT}* {text} { puts "A float: $text ([expr double($text)])" } \ {} ${DIGIT}+ {text} { puts "An integer: $text ([expr int($text)])" }
Note the use of variable substitutions in regexps. To make variable substitutions work, you also have to use the "rules-as-arguments" lexer style (see Lexing With Style): note the backslashes at every end of line. The other noticeable difference is the use of a variable named text to report the matched string. Flex uses yytext for that purpose. Be careful to respect Tcl's syntax when defining variables: here we have to enclose the name definitions between braces to avoid bracket substitution. The regexps are the same except for the added '$' for variable substition and the removal of inside double quotes. Both lexers should work the same.
Finally, there are subtle differences in regular expressions syntax (tcLex uses Tcl's syntax). First is the absence of trailing contexts in tcLex (syntax: r/s, where r is the regular expression and s is the trailing context). This shouldn't be a big problem though, unless when converting from flex to tcLex. Second is line-sensitivity. Flex uses line-sensitive regexps whereas Tcl uses line-insensitive by default. Since there is no support for line-sensitivity in Tcl8.0 and Tcl8.1 requires specific syntax, tcLex provides its own portable line-sensitivity through the use of the -lines switch. Line sensitivity changes the meaning of the special characters '^', '$' and '.' within regexps. With line-insensitive regexps, '^' and '$' respectively match the beginning and end of string, and '.' any character. With line-sensitive regexps, '^' and '$' respectively match the beginning and end of lines, and '.' any but the newline character.
To report the matched string, flex uses the special string variable yytext. TcLex uses a quite different method that is closer to Tcl philosophy (ie the regexp command) and also more powerful in term of feedback. With flex, parentheses are only used to override precedence in regexps. However Tcl regexps also use parentheses for reporting subexpressions. To keep this feature, tcLex needs variables to report matched substrings, the same way regexp does. Thus there is no predefined yytext variable but per-rule user-defined variables. This can allow for elegant constructs when several rules share the same action. Let's consider the following example, where several syntactically different but conceptually similar constructs have to be matched:
lexer lexCComments { {} "(/\\*)([^*]*)\\*/" {text style comment} - - "(//)(([^\\\\\n]|\\\\.)*)" {text style comment} { # C/C++ comments switch -- $style { /* {set lang C} // {set lang C++} } puts "$lang comment matched: $comment" } }
The two rules are used to match C and C++ comments. C comments are any characters enclosed between "/*" and "*/". C++ comments are any characters after "//" until the end of a line not ending with '\' (note the quoting hell required for '\' to work). Here the same action is used for two different rules, but using the same variables for reporting. This greatly simplifies rules writing as it allows using the same action for several similar yet different rules. Note that the order of the variables may differ.
There are some behavioral differences between flex and tcLex, but most can be overriden by using specific tcLex switches.
As previously said, flex is line-sensitive whereas tcLex is line-insensitive by default (as Tcl). To enable line-sensitivity, use the -lines flag.
Second, flex uses a longest prefered match scheme whereas tcLex uses a first match scheme by default. Longest prefered match means that every rule will be tried and the longest will be used. First match means that the first matching rule will be used. This can lead to strange behavior when converting from flex to tcLex. For example, this simple lexer taken from the flex man page:
DIGIT [0-9] ID [a-z][a-z0-9]* %% {DIGIT}+ { printf( "An integer: %s (%d)\n", yytext, atoi( yytext ) ); } {DIGIT}+"."{DIGIT}* { printf( "A float: %s (%g)\n", yytext, atof( yytext ) ); }
Converting straightforwardly this lexer to tcLex will give incorrect results:
set DIGIT {[0-9]} set ID {[a-z][a-z0-9]*} lexer lx \ {} ${DIGIT}+ {text} { puts "An integer: $text ([expr int($text)])" } \ {} ${DIGIT}+.${DIGIT}* {text} { puts "A float: $text ([expr double($text)])" }
Since tcLex uses first-match by default, the string "1.5" which would be seen as a float by flex (2nd rule) will be seen as an integer by tcLex because first rule is matched before the second. To avoid this behavior, either use the -longest flag or change the order of the rules. Using longest-match can hit performances when there is a large number of rules, but changing the order of the rules must also be done carefully. What to do will depend on the situation: simple lexers (like the one above) will behave better by changing order whereas more complex ones will be preferabily converted with -longest flag to avoid obscure bugs. Note that first-match scheme can also be useful to simplify rules definitions because one can take their precedence order into account, and can greatly speedup the processing. First-match scheme was chosed for tcLex because it is the way switch works.
The last behavioral difference lies in actions. With flex, actions often end with a return statement that returns a specific value to the calling context (typically a token returned to a yacc parser). TcLex lexers have a completely different behavior. Since lexers are seen as a mixture of Tcl switch and proc, a return statement has the same consequence as with these commands: stopping the processing. Thus it isn't possible yet to use tcLex lexers as tokenizers that return a value every time a rule succeeds. Future versions will certainly improve on this point and turn lexers into tokenizers. For now, the only way lexers can return a value during processing is through the incremental processing scheme (see Incremental processing). This choice was made to be consistent with proc and switch.
Flex provides many functions to access and/or modify the input buffer, such as C functions yymore(), yyless(), input() and unput(). TcLex provides similar volontarily limited features in the form of input and unput lexer subcommands. The reason for these limitations is that these features can be kludgy or can modify the input.
flex | tcLex | |
---|---|---|
use | generates static C source files | dynamic creation of Tcl commands |
syntax | specific | Tcl |
name definitions | yes, specific syntax | via Tcl substitutions |
match scheme | longest-prefered match only | both longest-prefered and first match |
line-sensitivity | line-sensitive only | both line-sensitive and -insensitive |
regular expressions | alike Tcl8.0 plus trailing contexts and characters classes | depend on Tcl version |
reentrancy | limited | yes |
subexpressions reporting | no | yes |
tokenization | yes | no |
incremental processing | must be specially designed | transparent |
multiple input buffers | yes | no |
If you plan to use tcLex as a Tcl replacement for flex, it is highly probable that you will have to convert a flex lexer to tcLex one day. Here are several rules of thumb to save time and avoid strange errors:
* .|\n {c} {puts -nonewline $c}
Home page:
http://www.multimania.com/fbonnet/Tcl/tcLex/index.en.htm (english)
http://www.multimania.com/fbonnet/Tcl/tcLex/index.htm (french)
Mailing list:
Home page: http://www.eGroups.com/list/tclex
To subscribe: tclex-subscribe@egroups.com