Section 1: Introduction

1.1 Overview

GNOME Speech aims to be a general interface to various text-to-speech engines for the GNOME desktop. It allows the simple speaking of text, as well as control over various speech parameters such as speech pitch, rate, and volume. It uses ORBit2 and Bonobo to facilitate the location and activation of, and communication with, the various speech drivers.

1.2 Justification for GNOME Speech

There are many different text to speech hardware and software products currently available. Some text-to-speech synthesizers are software libraries to which an application can be linked, and call commands to produce speech. Some text-to-speech engines are hardware devices, to which commands and text to be spoken are sent via the serial, USB, or parallel port. Still others are applications to which text and commands can be piped. In addition, although there are standard markup languages that specify how commands to change speech parameters can be embedded within text, not all engines support the same languages, and some don't support any markup languages at all.

It is for these reasons that a standard API for communicating with various text-to-speech engines is needed. This is where GNOME Speech becomes useful. It hides the differences in the implementation, API, and markups used by the various engines by defining an API that accommodates all the standard features of most speech engines, and some of the more obscure features supported by some engines. GNOME Speech driver implementations proxy the standard API, which is defined in IDL, to the various commands and markup language of a particular engine.

This drastically reduces the development time required for applications that want to produce speech with a wide variety of engines. The application developer no longer must focus on the internals of individual speech engines, but rather can focus on the core purpose of the application, and simply interface with multiple engines using the single GNOME Speech API. Other operating systems including Microsoft Windows, and Mac OS, include a speech API that in many cases supports both text to speech and voice recognition. GNOME Speech aims to eventually provide a similar speech API for the GNOME desktop. The initial version of GNOME Speech supports only text to speech, but work is currently underway to define a new GNOME Speech API that will support both text to speech and voice recognition (see section 5).

1.3 Sample Uses of GNOME Speech

GNOME Speech was originally designed as a part of the requirements of the Gnopernicus project, a project which aims to provide a full-featured screen reader for GNOME. This project, which is under the general umbrella of the GNOME Accessibility Project, provides speech and Braille feedback to blind and low vision users about current applications and windows on the screen. GNOME Speech could also be used in any number of other accessibility-related contexts, including assistive technologies which highlight and speak on-screen text for users with learning disabilities, and augmentative communication aids.

1.4 What Speech Engines are Currently Supported

Source code for GNOME Speech drivers supporting the following engines is currently provided in CVS:

Engine Name	Platforms Supported	Comments
eSpeak	Linux (other platforms?)
Festival	Linux/Solaris
FreeTTS	Linux/Solaris	Requires at least J2SDK 1.4.1 and java-access-bridge in order to build driver
Speech Dispatcher	Linux
IBM ViaVoice TTS	Linux Only	No longer availableon the web.
Eloquence	Linux/Solaris
DECTalk Software	Linux Only	$50 download from Fonix
Cepstral	Linux/Solaris	Available as a $29 download from Cepstral

Section 2. Overview

2.1 Prerequisites

This paper assumes at least a minimal understanding of the Glib object system, Bonobo, Bonobo-activation, and ORBit2. A list of useful resources in learning about these technologies and their applications follows:

2.1 The role of Bonobo

GNOME Speech has the following design requirements:

Clients should be able to get a list of installed drivers
Clients should be able to get some amount of information about supported features of installed drivers
Driver implementations should be object-oriented as to facilitate code re-use
It should be possible to write drivers in any language

For these reasons, the combination of Bonobo and Bonobo-activation was chosen as the IPC and object framework for GNOME Speech.

2.2 Querying for information about installed drivers

GNOME Speech drivers are standard Bonobo servers, so the standard Bonobo-activation calls are used to query for information about currently installed GNOME Speech drivers. Querying for support of the interface named GNOME_Speech_SynthesisDriver will return the list of all GNOME Speech drivers which are installed on the system. An application can also query for the interface named GNOME_Speech_SpeechCallback to get a list of GNOME Speech drivers which are capable of providing speech callback information.

3. Implementing a GNOME Speech driver

3.1 Checklist and general considerations

Some things to consider before implementing a GNOME Speech driver:

Is the engine for which the driver is to be written proprietary? If so, is it possible to write an Open Source GNOME Speech Driver if desired?
Does the engine provide speech callbacks (I.E., does it provide status information about current speech progress?
Does the engine require a multi-threaded client? If so, how will this be integrated into the Glib main loop?
Does the engine provide its own audio output? If not, how will you get the audio to the soundcard? (Note that it is more difficult to provide accurate callback information for engines that do not produce their own audio output).

3.2 Interfaces and Data Structures

At a minimum, a GNOME Speech driver must support two interfaces, the SynthesisDriver and Speaker interfaces.

3.2.1 The SynthesisDriver Interface

The SynthesisDriver interface provides basic information about the text-to-speech engine and the GNOME speech driver, and allows creation of Speaker objects (instances of the text-to-speech engine). The interface is defined as follows:

interface SynthesisDriver : Bonobo::Unknown {
	readonly attribute string driverName;
	readonly attribute string synthesizerName;
	readonly attribute string driverVersion;
	readonly attribute string synthesizerVersion;

	boolean driverInit ();
	boolean isInitialized ();

	VoiceInfoList getVoices (in VoiceInfo voice_spec);
	VoiceInfoList getAllVoices ();

	Speaker createSpeaker (in VoiceInfo voice_spec);
};

The VoiceInfo structure allows a client to specify information about a voice, such as its name, language, or gender. The client can then perform queries of the driver to determine what voices it supports by filling in members of the VoiceInfo structure. The getVoices function should return all voices supported by the driver which meet all the requirements specified in the VoiceInfo structure passed to it. The getAllVoices function should return the VoiceInfo structures for all voices supported by the driver.

the createSpeaker function should return a Speaker object. This object is created using the first voice which meets the requirements specified in the provided VoiceInfo structure.

3.2.2 The Speaker Interface

A GNOME Speech driver's implementation of the Speaker interface is the part of the driver which actually controls the text-to-speech engine. The interface is defined as follows:

interface Speaker : Bonobo::Unknown {

	ParameterList getSupportedParameters ();
	string getParameterValueDescription (in string name,
	in double value);
	double getParameterValue (in string name);
	boolean setParameterValue (in string name, in double value);
    
	long say (in string text);
	boolean stop ();
	boolean isSpeaking ();
	void wait ();
    
	boolean registerSpeechCallback (in SpeechCallback callback);
};

A ParameterList is a sequence of Parameter structures. The Parameter structure is defined as follows:

struct Parameter {
	string name;
	double min;
	double current;
	double 	max;
	boolean enumerated;
};

Every parameter has a unique name, and a minimum, current, and maximum value. These basic parameters allow for setting parameters with numeric values such as speaking rate in words per minute, or the baseline pitch of the voice in Hz. The getParameterValue returns the current value of the parameter whose name is specified, and the setParameterValue function sets the current value of the parameter whose name is specified. (Note that if the new value is out of range, setParameterValue should return FALSE).

GNOME Speech also defines a mechanism of describing parameters which are not necessarily numeric. The getParameterValue and setParameterValue functions are still used to get and set the values of these enumerated parameters. However, the getValueDescription function can be used to retrieve a text description of the various values within the parameter's range.

While standard names for parameters are not strictly enforced, some recommendations are listed here:

Parameter Name	Description
rate	Speaking Rate in Words Per Minute
pitch	Baseline Speaking Pitch in Hz.
Volume	Speaking Volume (recommended range is 0 - 100)

The say function causes the driver to speak the specified text. The driver should return a unique long identifying the particular string to be used for future reference in handling speech callbacks. The drivers should return immediately, and not wait until speech is finished before returning.

The stop function stops speech immediately and flushes anything in the text-to-speech engine's queue. The isSpeaking function returns true if the engine is currently speaking and false if not. The wait method returns only after any current speech has finished.

3.2.3 SpeechCallback

The SpeechCallback interface is actually not implemented by the GNOME Speech driver, but rather by the GNOME Speech client. This is the interface that GNOME Speech drivers use to communicate information about speech progress to their clients. The SpeechCallback interface defines only one function, notify, which takes the key identifying the string, the type of the callback, and possibly a text offset. GNOME Speech defines three types of callbacks, speech started, speech finished, and index. If a callback of type index is received, the key identifies the particular string being spoken, and the offset indicates the offset of the last character that has been spoken.

3.3 Supporting speech callbacks

Support for speech callbacks can be the most difficult part of a GNOME Speech driver to implement. The following are some suggestions to make providing speech callbacks easier.

If the engine for which the driver is written does not support speech callbacks, the driver implementer should at least do the following:

Ensure that the GNOME Speech driver's .server file indicates that the driver does not support the GNOME_Speech_SpeechCallback interface.
Ensure that the speaker's implementation of the registerCallback function returns FALSE.

To provide support for callbacks, a driver's implementation of the Speaker interface must provide at least the following:

Implement a callback listener that listens to the engine specific callbacks.

Section 4: Implementing a GNOME Speech Client

4.1 Proper setup

An application wanting to produce speech using GNOME Speech should first obtain a list of GNOME Speech drivers which are installed on the system. If no callbacks are desired, then the application need only request a list of Bonobo servers implementing the GNOME_Speech_SynthesisDriver interface. If callbacks are required, then the application should request a list of Bonobo servers that implement GNOME_Speech_SpeechCallback. Bonobo-activation is used to obtain this list.

Once the application has a list of available speech drivers, it uses Bonobo-activation to activate one of them. The object that is returned by the bonobo_activation_activate call is an object which implements the GNOME_Speech_SynthesisDriver interface.

Before calling any functions on the object, the application should call the driverInit function. This function returns true if the driver was successfully initialized, false otherwise. If the driverInit function returns false, then the application should not attempt to call any other functions on the object.

Ghe application can call the getDriverName, getDriverVersion, getSynthesizerName, and getSynthesizerVersion functions to ddetermine the name and version of the GNOME Speech driver and the underlying text-to-speech engine.

The application can call createSpeaker, which creates and returns an object implementing the GNOME_Speech_Speaker interface. This interface can be used to speak text and set various speech characteristics such as speaking rate and pitch.

4.2 Handling Speech Callbacks

In order for an application to receive notifications about speech progress and status, it must contain an object that implements the GNOME_Speech_SpeechCallback interface. Once a speaker is created, the application should register it's callback interface with the speaker using the registerSpeechCallback function

5. The Future of GNOME Speech

5.1 GNOME Speech 1.0

Work is underway to totally rewrite the GNOME Speech API in preparation for a GNOME Speech 1.0 release. The major improvements planned for 1.0 include:

API heavily influenced by the Java Speech API
API for speech recognition will be included
A markup language for marking up text with information about speech characteristics will be supported.

5.2. BUS and KDE interoperability

Work is also underway in prototyping a system based on D-Bus rather than Bonobo. Under this system, D-Bus would replace Bonobo as the underlying IPC mechanism. This would better facilitate interoperability with KDE.