WFST - Weighted Finite State Transducers Applied to Spoken Language Processing

From HLT@INESC-ID

Team

Project Leader: Diamantino Caseiro

Summary

Finite-State models (FSM) and, in particular, Weighted Finite-State transducers (WFST) have proven quite successful in many fields of written and spoken language processing. This includes in particular machine translation, large vocabulary continuous speech recognition and speech synthesis. An interesting feature of FSMs is that they can be automatically built or "learned" from training data using corpus­based techniques. Compared to more traditional knowledge­based approaches, these techniques are attractive for their potential of much lower development costs. Another interesting property of FSMs is their feasibility for implementing or approximating knowledge-based techniques. Different knowledge sources can hence be represented via FSMs, thus allowing the integration of apriori knowledge with inductive techniques in a natural and formally elegant way. This makes the FSM framework an adequate one for language processing.

The main goal of this project is the application of this framework to speech recognition and synthesis, which will constitute the themes of the two major tasks.

The group has already acquired some experience in modeling the various components of recognition systems using WFSTs, having developed specialized algorithms for transducer composition for on-the-fly lexicon and language model integration. Preliminary experiments with the explicit integration of phonological rules have also produced encouraging results, but much remains to be done in order to be able to achieve higher recognition rates, specially in what concerns spontaneous speech, not only in terms of pronunciation modeling, but also of language modeling.

The group's experience with the application of WFSTs to different modules of text-to-speech (TTS) synthesis is much more recent. We are currently exploring the potential of WFST application, namely in terms of grapheme-to-phone conversion and variable duration segment selection in synthesis by concatenation. We plan to continue this preliminary work and extend it to other modules.

A third task will deal with more exploratory themes on which the group currently has no experience, having as a goal the integration of different knowledge sources. Potential themes are the use of transducers for topic indexation, or for integrating translation models to help speech recognition of documents which have originally been written in one language and are being dictated by a human translator in another language.

The long term goal of this project is a unified framework for spoken language processing, that would encompass not only speech recognition and synthesis but rather speech understanding and speech generation from concept, being able to include all the components of a speech-to-speech translation system. This goal is far too ambitious, but we believe that significant progress towards it can be made by learning how to represent and integrate new knowledge sources.

Demos

A demonstration of tightly integrated speech-to-text translation is available. The translation module is implemented as a single WFST that is used as the language model in the speech recognizer. This architecture produces sentences in the target language directly from source language speech.

A demonstration of large vocabulary translation is also available. The output of the WFST-based speech recognition module was translated using a WFST-based machine translation module trained in the European Parliament domain.

Recent demos:

Broadcast News translation from Portuguese to Spanish and English

Broadcast News translation from South American Spanish to Portuguese