OpenToken

current version: 5.0a

OpenToken may be obtained in several ways:

source Gnu tar bzip2 opentoken-5.0a.tar.bz2
source zip opentoken-5.0a.zip
Debian binary package "libopentoken", source package "opentoken"
monotone server www.ada-france.org, branch org.opentoken. Maintained by Ludovic Brenta

OpenToken is a facility for performing token analysis and parsing within the Ada language. It is designed to provide all the functionality of a traditional lexical analyzer/parser generator, such as lex/yacc. But due to the magic of inheritance and runtime polymorphism it is implemented entirely in Ada as withed-in code. No precompilation step is required, and no messy tool-generated source code is created. The tradeoff is that the grammar is generated at runtime.

For example, here is an OpenToken LR specification of a typical arithmetic grammar, taken from Example 5.10, section 5.3, page 295 of the red dragon book (note that we have added an explicit EOF (end of file) symbol):

Grammar specification:

L -> E EOF
E -> E + T
E -> T
T -> T * F
T -> F
F -> ( E )
F -> integer

Ada code:

Grammar : constant Production_List.Instance :=
 L <= E & EOF                      + Print_Value'Access and
 E <= E & Plus & T                 + Add_Integers       and
 E <= T                                                 and
 T <= T & Times & F                + Multiply_Integers  and
 T <= F                                                 and
 F <= Left_Paren & E & Right_Paren + Synthesize_Second  and
 F <= Int_Literal;

This grammar is processed at runtime into a state machine data structure, then used to parse input text. It uses dynamic memory allocation at runtime to hold a stack of the tokens and productions.

Here is a specification of an equivalent grammar, rewritten to allow a recursive-descent implementation:

Grammar specification:

L -> E  EOF
E -> T {+ T}
T -> F {* F}
F -> ( E )
F -> integer

Ada code:

E : Operation_List.Handle    := new Operation_List.Instance;
F : Integer_Selection.Handle :=
 (Left_Paren & E & Right_Paren + Build_Parens'Access or Int) + Build_Selection'Access;
T : Operation_List.Handle    := F ** Times * Times_Element'Access + Init_Times'Access;
L : Integer_Sequence.Handle  := E & EOF + Build_Print'Access;

E.all := Operation_List.Get
 (Element     => T,
  Separator   => Plus,
  Initialize  => Init_Plus'Access,
  Add_Element => Plus_Element'Access);

This grammar is used directly at runtime to parse input text. It does not use dynamic memory allocation during parsing (dynamic memory allocation is used while building the grammar in the statements above), but it does use more stack space than the LR version. Both parsers are reasonably efficient.

OpenToken also allows implementing horribly inefficient recursive descent grammars that still work; it supports infinite lookahead with backtracking. This can be useful when just getting started with parsers; first get it working, then optimize it.

The lexer phase of OpenToken parsers is implemented by classes of recognizers. Each recognizer runs in parallel, and the one matching the longest input lexeme wins. This is not as efficient as traditional lexers, for example those generated by lex, since those use a finite state machine. But it is easier to use, because the recognizers are already written, well documented, and can be independently tested.

Some reusable parse tokens are also included in OpenToken, but they tend to be domain-specific.

Ada's type safety features make misbehaving lexers and parsers easier to debug. In addition, OpenToken includes a trace feature, that outputs useful information during grammar compile and user input parsing, also aiding debugging.

OpenToken is distributed under GPL version 3, with the GNAT modification that allows use in non-GPL projects.

OpenToken was originally written by Ted Dennison. Contributions were made by Christoph Grein and Stephen Leake.

OpenToken is currently maintained by Stephen Leake. Submit bugs directly to Stephen. Discussion about OpenToken is on the newsgroup comp.lang.ada.

OpenToken has been tested on the following operating systems/compilers:

Debian 5.0 GNAT 3.4.2
Debian 6.0 GNAT 4.4.3
Windows XP 32 bit GNAT 6.2.1
Windows XP 32 bit GNAT GPL-2009
Windows 7 64 bit GNAT 7.1 preview

The simplest way to use the source distribution is to install it in the standard GNAT directories, where the compiler will find it by default. This assumes you have write permission in the GNAT directory tree. Then just put 'with "OpenToken";' in your project file. To install it on Windows, using a DOS shell, and assuming GNAT and 'make' are in your PATH:

cd opentoken/Build/windows_release
make -f Makefile.install install

On GNU/Linux, use 'linux_release' instead of 'windows_release'.

See developer instructions if you want to run the tests or work on OpenToken, and for hints in case the above does not work.

Ted's original OpenToken web page is no longer being maintained. The AdaPower mailing list and discussion board are no longer used.

There is a User's Guide that describes how to create a simple application using OpenToken. There are also several examples in the source; see the Examples directory. The Test directory also has useful examples, although oriented towards testing OpenToken itself.

History

Version 5.0a 9 Feb 2014

Fix Debian bug 703361; OpenToken.Token.Enumerated.Analyzer.Column function is off by 1 for all lines other than the first.
Improve library installation on Windows, Linux, Mac
API changes
- OpenToken.Production.Parser.LALR now takes a generic parameter First_State_Index
- OpenToken.Production.Parser.LALR.Generate now takes additional output control parameters
- OpenToken.Production.Parser.LALR: numerous bugs fixed, mostly around handling empty productions. Support generating parser tables for a generalized LALR parser.
- OpenToken.Recognizer.Bracketed_Comment now has a Value function that returns the comment text.
- OpenToken.Recognizer.Identifier.Get no longer has defaults for Start_Chars, Body_Chars
- OpenToken.Text_Feeder.Text_IO.Create now takes an argument
- OpenToken.Token.Analyzer takes more generic parameters, and imposes an order on Token_ID.
- OpenToken.Token.Enumerated takes more generic parameters
- OpenToken.Token.Enumerated.Input_Feeder deleted; use Set_Text_Feeder
- OpenToken.Token.Enumerated.Token_ID type: Non-reporting tokens must appear first

Version 4.0b 29 Jun 2010

Packaging change to work around Debian packaging mixup; also changed from gzip to bzip2.
Improved this web page.

Version 4.0a 7 Feb 2010

Lookahead and backtracking is actually supported in recursive descent parsers. This required several changes:
- Could_Parse_To is gone; the same purpose is served by Parse (Actively => False).
- Raise_Parse_Error is gone; the message composed by the provided packages is already excellent.
- Token types derived from Enumerated.Instance that need to store the Lexeme and/or Recognizer passed in Create need to override the new procedure Copy. This is because Create is called only by Analyzer.Find_Next, not by Enumerated.Parse; it used to be called by both.
- Enumerated.Create profile is changed; ID is not needed, since New_Token is in out. When you fix Create, remember to consider adding Copy.
- Token.Sequence tokens each store a lookahead count, with a global default.
- User defined token types that do backtracking must call Analyzer.Mark_Push_Back and Analyzer.Push_Back to manage the lookahead queue; see Token.Selection for an example.
Fixed major bug in LALR parser generator related to which production gets the accept action. This bug made many small grammars unworkable; now they all work.
Other enhancements
- Syntax errors reported by LR and recursive descent parsers include the list of expected tokens. This requires the Match argument to recursive descent Parse procedures to be of mode 'access' instead of 'in out'.
- There is a dispatching Name function to allow tokens to identify themselves in the list of expected tokens.
- The examples have been improved to more clearly demonstrate the differences and similarities between LR parsing and recursive descent parsing with OpenToken.
- The OpenToken.Token.List_Mixin, .Sequence_Mixin, .Selection_Mixin packages have been enhanced to support backtracking; the non-mixin versions are instantiations of them.
- The OpenToken.Token.List_Mixin, .Sequence_Mixin, .Selection_Mixin now specify actions via procedure pointers at run-time, rather than via overloaded procedures. This significantly simplifies specifying recursive descent grammars.
- There are new examples of recursive descent parsing, showing that naive grammars can work, if inefficiently.
- OpenToken.Token.Enumerated.Integer_Literal, .Real_Literal, and .String_Literal are renamed to .Integer, .Real, and .String, and the Value component made publicly visible. This allows them to be used as valued tokens in recursive descent parsers.
- OpenToken.Production.Parser.LALR.Set_Trace is gone; use OpenToken.Trace_Parse instead.
- Language_Lexers.HTML_Lexer supports the
```
<pre>
```
  tag; the contents are treated as a comment.

Version 3.1 August 4, 2009

maintainer transitioned to Stephen Leake
bugs fixed
- opentoken-production-parser-lalr.adb Add_Action
  Don't put duplicate action in twice. Tested in association_token_test.
- opentoken-production-parser-lrk_item.adb Closure
  Delete premature loop exit. Tested in name_token_test.
- opentoken-production-parser-lrk_item.adb Goto_Transitions
  Handle duplicate goto set. Tested in test_lr0_kernels.
- opentoken-recognizer-integer.adb Analyze
  Set verdict when Allow_Signs. Tested in recognizer_integer_test. Debian bug 536359.
- opentoken-recognizer-string.adb
  Set C_Style_Escape_Code_Map properly. Test in string_test-run.adb.
- Language_Lexers/ada_lexer.ads, .adb
  Handle Character'('x'). Debian bug 498945.
significant changes
- Add trace feature; shows sequence of states followed by the parser. Very useful for debugging a grammar.
- Generate reports shift/reduce conflicts.
- Parse uses Gnu syntax to report syntax errors.
- Support line, column for error messages
- Allow setting the parser's text feeder.
- Add a new version of HTML_Lexer that is task safe.
- HTML_Lexer allows '"' and '>' in paragraphs; not clear why they were disallowed. It also allows more characters in tags.
- New procedure opentoken-token-enumerated-analyzer.ads Discard_Buffered_Text
  For calling parser in a loop while handling exceptions.
- New recognizer opentoken-recognizer-based_integer.adb. Recognizes Ada integer literal with optional base. See recognizer_based_integer_test.adb for examples.
- New token opentoken-token-enumerated-identifier.adb. Tested in test_token_identifier_real_string. Useful for user symbol names.
- New tokens opentoken-token-enumerated-real_literal.adb, opentoken-token-enumerated-string_literal.adb. Tested in test_token_identifier_real_string
- Makefiles build GNAT libraries, and install in GNAT tree.
New tests
- association_token_test
- name_token_test
- based_integer_test
- recognizer_integer_test
- string_token_test
- test_html_lexer (there is now a known good result for one input file)
- test_lr0_kernels
- test_token_identifier_real_string
Other changes
- change to unix line ending
- update to GPL 3
- delete all CVS logs
- Enforce style via -gnaty3abcefhiklM120nprtx, and -gnatyO ('overriding') in gnat gpl-2009, 6.2.1. This enforces capitalization of OpenToken, ID, HTML, etc.
- Ada 2005 syntax for some 'raise exception', 'is null'
- Add GNAT pragma Unreferenced where needed
- Merge all Makefiles into one, in a separate Build directory
- In opentoken-production-parser-lrk_item.adb Generate and opentoken-production-parser-lalr.adb LR0_Kernels; optimize computation of First_Derivations (was done twice).
- print Ada.Tags.Expanded name in productions
- generally improve readability of dump
- only one Parse_Error exception declared
- In Opentoken.Recognizer.Graphic_Character; add Redefine
- delete gnat 3.13 workarounds
- delete commented out code

Version 3.0b 13 August 2000

by Ted Dennison

This version introduces recursive decent parsing. This has the following advantages over table-driven parsers:

It's simpler to implement.
It provides many more opportunities for reuse.
Its parsers are debuggable.
There's no expensive parser-generation phase.

The disadvantages are:

Its parsers are most likely a bit slower.

A general list of the changes is below:

Renamed OpenToken.Token tree to OpenToken.Token.Enumerated.
Created a new (non-enumerated) base token type and base analyzer type in OpenToken.Token.
Made a Parse routine and a Could_Parse_To routine primitives of the base token type.
Created the following predefined nonterminal tokens (both as straight types, and as mixins).
- List
- Selection
- Sequence
Fixed a bug in the bracketed comment recognizer.
Implemented a (hopefully temporary) work-around for a bug in Gnat version 3.13p.
Fixed a bug in the string recognizer where it was mishandling octal and hex escape sequences.
Changed the analyzer and the text feeders to support analyzing binary files.
The HTML lexer has been improved to be a bit faster and more flexible.

Version 2.0 27 January 2000

This is the first version to include parsing capability. The existing packages underwent a major reorganization to accommodate the new functionality. As some of the restructuring that was done is incompatible with old code, the major revision has been bumped up to 2. A partial list of changes is below:

Renamed the top level of the hierarchy from Token to OpenToken.
Moved the analyzer underneath the new OpenToken.Token hierarchy.
Renamed the Token recognizers from Token.* to OpenToken.Recognizer.*
Changed the text feeder procedure pointer into a text feeder object. This will allow full re-entrancy in analyzers that was thwarted by those global text feeders previously.
Updated the SLOC counter to read a list of files to process from a file. It also handles files with errors in them a bit better.
Added lalr(1) parsing capability and numerous packages to support it. A structure is in place to build other parsers as well.
Created a package hierarchy to support parse tokens. The word "Token" in OpenToken now refers to objects of this type, rather than to token recognizers.
An HTML lexer has been added to the language lexers.
.Recognizer.Bracketed_Comment now works properly with single-character terminators.

Version 1.3.6

This version fixes a rare bug in the Ada style based numeric recognizers. The SLOC counter can now successfully count all the source files in Gnat's adainclude directory.

Version 1.3.5

This version adds a simple Ada SLOC counting program into the examples. A bug with the Real token recognizer that caused constraint_errors has been fixed. Also bugs causing constraint errors in the ada-style based integer and real recognizers on long non-based numbers have been fixed.

Version 1.3

This version adds the default token capability to the Analyzer package. This allows a more flexible (if somewhat inefficient) means of error handling to the analyzer. The default token can be used as an error token, or it can be made into a non-reportable token to ignore unknown elements entirely.

Identifier tokens were generalized a bit to allow user-defined character sets for the first and subsequent characters. This not only gives it the ability to handle syntaxes that don't exacly match Ada's, but it allows one to define identifiers for languages that aren't latin-1 based. Also, the ability to turn off non-repeatable underscores was added.

Integer and Real tokens had an option added to support signed literals. This option is set on by default (which causes a minor backward incompatibility). Syntaxes that have addition or subtraction operators will need to turn this option off.

A test to verify proper handling of default parameters was added to the Test directory. A makefile was also added to the same directory to facilitate automatic compiling and running of the tests. This makefile will not work in a non-Gnat/NT environment without some modification.

New recognizers were added for enclosed comments (eg: C's /* */ comments) and single character escape sequences. Also a "null" recognizer was added for use as a default token.

Version 1.2.1

This version adds the CSV field token recognizer that was inadvertently left out of 1.2. This recognizer was designed to match fields in comma-separated value (CSV) files, which is a somewhat standard file format for databases and spreadsheets. Also, the extraneous CVS directories in the zip version of the distribution were removed.

Version 1.2

The long-awaited string recognizer has been added. It is capable of recognizing both C and Ada-style strings. In addition, there are a great many submissions by Christoph Grein in this release. He contributed mostly complete lexical analyzers for both Java and Ada, along with all the extra token recognizers he needed to accomplish this feat. He didn't need as many extra recognizers as I would have thought he'd need. But even so, slightly less than 1/2 of the recognizers in this release were contributed by Chris (with a broken arm, no less!)

Version 1.1

The main code change to this version is a default text feeder function that has been added to the analyzer. It reads its input from Ada.Text_IO.Current_Input, so you can change the file to whatever you want fairly easily. The capability to create and use your own feeder function still exists, but it should not be necessary in most cases. If you already have code that does this, it should still compile and work properly.

The other addition is the first version of the OpenToken user's guide. All it contains right now is a user manual walking through the steps needed to make a simple token analyzer. Feedback and/or ideas on this are welcome.

Version 1.0

This is the very first publicly released version. This package is based on work I (Ted Dennison) did while working on the JPATS trainer for FlightSafety International. The germ of this idea came while I was trying to port a fairly ambitious, but fatally buggy Ada 83 token recognition package written for a previous simulator. But once I was done, I was rather suprised at the flexibility of the final product. Seeing the possible benefit to the community, and to the company through user-submitted enhancement and debugging, I suggested that this code be released as Open Source. They were open-minded enough to agree. Bravo!

my home page Author : Stephen Leake Last modified: Tue Jan 01 23:39:35 EST 2013