Speech Recognition Applied to Telecommunications (1997-2000)
- T1 - Recognition of spelled names
- T2 - Word and topic spotting
- T3 - Robust recognition of digits and natural numbers
- T4 - Large vocabulary speech recognition
- T5 - Speaker recognition
- T6 - Automatic spoken language identification
- Start: April 1997
- Duration: 3 years
INESC (Instituto de Engenharia de Sistemas e Computadores, Lisbon) - Speech Processing Group
IT (Instituto de Telecomunicações, Coimbra) - Signal and Image Processing Group
- Luís Vieira de Sá
- Fernando Perdigão
- Eduardo Sá Marta
- Undergraduate students:
- Rui António Rodrigues Francisco
- Mário José Santiago Batista
- José Miguel Donário Veríssimo
- Raquel Maria Tavares Marques
- Elisabete Cordeiro
- João Duque
- Jorge Rato
FEUP (Faculdade de Engenharia da Universidade do Porto) - Speech Recognition Laboratory
- Carlos Espain de Oliveira
- Vítor Pera
- Luís Filipe Moreira
- Undergraduate students:
- Cláudia Alexandra Bartolo Araújo
- Jonatas Miguel Suzano Afonso
- José Fernando Sacramento Dias
- Mário Jorge Miranda
- Nuno Miguel Martins Domingos
- Rui André de Carvalho
Although informally, the project has also profited from useful interchange with researchers from the University of Aveiro (INESC): Francisco Vaz and António Teixeira.
The goal of this project is to gather expertise at a national level in the area of speech recognition, particularly in what concerns its applications to the telecommunications domain.
Although research on speech recognition for European Portuguese has achieved an international level through the participation in worldwide conferences and European projects, it is far from being able to compare with the one invested in languages of greater technological impact (English, namely, but also French, German and Japanese). Given the wide gap that must be conquered to achieve comparable levels and the relative small dimension of the few national research teams, it is of the utmost importance to congregate efforts in a common project, sharing corpora, software tools, methodologies, prototypes, etc.
We have selected 7 priority topics in this area, which correspond to 6 different work-packages:
- Spectral pre-processing based on auditory models
- Speaker independent phoneme recognition
- Word spotting
- Automatic segmentation and labelling
- Large vocabulary speech recognition
- Speaker recognition
- Automatic spoken language identification
By exploring these 7 domains, the project aims to lay the groundwork for building speech recognitions systems which are robust enough to be used in many telecommunications applications. Although the time scale of the project does not allow the implementation of all such application demonstrators, an open workshop is scheduled for the end of the project to provide a forum for enlarged technical discussions and disseminating results among operators and service providers.
The evaluators of the above proposal have made several recommendations: to eliminate the task of speaker recognition and to significantly reduce the effort in cochlear models and speaker-independent phoneme recognition. They also criticized the proposed work in the area of keyword spotting (not fully updated) and recommended the coordination of efforts in segmentation and labelling and in large vocabulary speech recognition with another project within INESC (Lisbon).
These recommendations, together with the fact that almost 2 years went by since the proposal was submitted until the effective beginning of the project, have led us to reformulate the tasks and the corresponding effort, in order to guarantee their actuality. The fact that the teams have also considerably changed, due to several members leaving (mainly scholarship students) and to new members joining the project, also contributed to the need for this task reformulation.
In what concerns the speaker-independent phoneme recognition task, the recommendation was taken into account by merging the proposed work in the task of recognition of spelled names, which is the main application for the work that was proposed and which is specially important in the context of directory information services nowadays.
Given the fact that most of the work which was originally planned on keyword spotting was developed in the context of digit recognition by the IT team in the period between project submission and approval, this task was also changed to encompass keyword spotting in general intead of just digit spotting and the original effort was considerably reduced. This change tried to take into account the recommendation of using other keword spotting techniques not mentioned in the original proposal. In a later stage of the project, the goals of this task have become more ambitious, encompassing topic detection as well. This extension was motivated by the fact that the state of the art on keyword spotting can now be considered advanced enough and the two themes are obviously related.
In what concerns the task on "automatic segmentation and labelling", which was originally proposed for isolated digit recognition, the goal was extended to encompass connected digits and natural numbers. Since robustness is still an issue in the recognition of this type of small dimension vocabulary, we have tried to deal with this problem by using, among other techniques, spectral pre-processing based on auditory models. The new more general task is entitled "robust recognition of digits and natural numbers".
Since the task of speaker recognition was the most interesting one for the FEUP team, the task was not removed, but the initial effort was considerably reduced, namely in terms of participation of other teams.
The other project recommendation concerned the cooperation with other national teams working on speech recognition, namely the neural networks team in INESC, Lisbon (who participates in a PRAXIS XXI project on continuous speeech recognition) and the team of the University of Aveiro (also INESC).This recommendation was taken into account by inviting members from these two teams to the most technically oriented meetings of the REC project. We gratefully acknowledge the contributions of the following researchers: João Paulo Neto, Ciro Martins and Hugo Meinedo (INESC / IST), Francisco Vaz and António Teixeira (Univ. Aveiro / INESC). For more information, see also cooperation with other projects.
After restructuring, the planned effort was defined as in the table below:
|TASK ||INESC ||IT ||FEUP ||TOTAL
|T1 - Recognition of spelled names ||6 ||20* || - || 26
|T2 - Word and topic spotting ||6 ||11* ||8 ||25
|T3 - Robust recognition of digits and natural numbers
||20* ||20 ||- ||40
|T4 - Large vocabulary speech recognition || 24*
||2 ||2 ||28
|T5 - Speaker recognition || - ||- ||15* ||15
|T6 - Automatic spoken language identification || 24* || 2
|| 5 || 31
|TOTAL ||80 ||55 ||30 ||165
The responsability of each task is indicated with * in the above table. The duration of each task is the duration of the project (3 years) except for taks T6 which ends in the end of the second year.
- Aceleração do Algoritmo de Backpropagation, Víctor Pera and Carlos Espain, III Congresso Brasileiro de Redes Neurais, Florianopolis, Brazil, July 1997.
- Properties of Auditory Model Representations, Fernando Perdigão and Luís V. Sá, EUROSPEECH'97 - European Conference on Speech Technology, Greece, September 1997.
- Impact of "Ascending Sequence" AI (Auditory Primary Cortex) Cells on Stop Consonant Perception, Eduardo S. Marta and Luís V. Sá, EUROSPEECH'97 - European Conference on Speech Technology, Greece, September 1997.
- Recognition of Non-Native Accents, Carlos Teixeira, Isabel Trancoso and António Serralheiro, EUROSPEECH'97 - European Conference on Speech Technology, Greece, September de 1997.
- A Vector Quantizer Architecture for an Automatic Speech Recognizer, A. J. Araujo, V. Pera, C. Espain, and J. Matos, XIII Conference on Design of Circuits and Integrated Systems, Madrid, Spain, February 1998.
- Language Identification Using Minimum Linguistic Information, Diamantino Caseiro and Isabel Trancoso, RECPAD'98 - 10th Portuguese Conference on Pattern Recognition, Lisbon, March 1998.
- Auditory Models as Front-Ends for Speech Recognition, Fernando Perdigão and Luís V. Sá, NATO Advanced Study Institute on Computational Hearing, Il Ciocco, Italy, July 1998.
- Auditory Cells with Frequency Resolution Sharper than Critical Bands Play a Role in Stop Consonant Perception: Evidence from Across-Language Recognition Experiments, Eduardo S. Marta and Luís V. Sá, NATO Advanced Study Institute on Computational Hearing, Il Ciocco, Italy, July 1998.
- Modelo Computacional da Cóclea Humana, Fernando Perdigão and Luís V. Sá, ACÚSTICA'98 - Congresso Ibérico de Acústica, Lisbon, September 1998.
- Language Identification of Spoken European Languages, Diamantino Caseiro and Isabel Trancoso, EUSIPCO'98 - European Signal Processing Conference, Rhodes, Greece, September 1998.
- Spoken Language Identification using the Speechdat Corpus, Diamantino Caseiro and Isabel Trancoso, ICSLP'98 - International Conference on Spoken Language Processing, Sydney, Australia, December 1998.
- Vector Quantizer Acceleration for an Automatic Speech Recognizer Application, A. J. Araujo, V. C. Pera and M. N. Souza, ICSLP'98 - International Conference on Spoken Language Processing, Sydney, Australia, December 1998.
- A Noise Suppression Technique using an Auditory Model, Fernando Perdigão and Luís V. Sá, CONFTELE'99 - 2nd Conference on Telecommunications, Sesimbra, April 1999.
- Digit Recognition Using the SPEECHDAT Corpus, Frederico Rodrigues and Isabel Trancoso, CONFTELE'99 - 2nd Conference on Telecommunications, Sesimbra, April 1999.
- Auditory Features for Human Communication of Stop Consonants under Full-Band and Low-Pass Conditions, Eduardo S. Marta and Luís V. Sá, EUROSPEECH'99 - European Conference on Speech Technology, Budapest, Hungary, September 1999.
- Developing a Voiced Information Retrieval System for the Portuguese Language Capable to Handle both Brazilian and Portuguese Spoken Versions, M. Souza, E. Caprini, C. Machado, M. Ludolf, L. Calôba, J. Seixas, F. Resende, S. Netto, D. Freitas, J. Teixeira, C. Espain, V. Pera and F. Moreira, EUROSPEECH'99 - European Conference on Speech Technology, Budapest, Hungary, September 1999.
- Auditory Features Underlying Cross-Language Human Capabilities in Stop Consonant Discrimination, Eduardo S. Marta and Luís V. Sá, MIST - Multi-lingual Interoperability in Speech Technology, Luesden, The Netherlands, September 1999.
- Anotação Fonética Automática de Corpora de Fala Transcritos Ortograficamente, Rui Amaral, Pedro Carvalho, Diamantino Caseiro, Isabel Trancoso, and Luís Oliveira, PROPOR'99 - IV Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada, Évora, September 1999.
- Reconhecimento de Dígitos e Números Naturais, Frederico Rodrigues and Isabel Trancoso, PROPOR'99 - IV Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada, Évora, September 1999.
- The Generalized Quickpropagation Algorithm, Carlos Espain and Vítor Pera, Signal and Image Processing 99, Bahamas, October 1999.
- Reconhecimento Computacional da Fala, Carlos Espain, Cadernos do CEFAT, Porto, 1999.
- Text-Independent Speaker Verification Using String Codebooks, Filipe Moreira and Carlos Espain, Proc. COST 250 MCM, Porto, 1999.
- Designing a SC-HMM System as aTool for Future Research in Multi-stream Continuous Speech Recognition, Vítor Pera, Proc. COST 249 Meeting, Coimbra, 2000.
- Designing a SC-HMM System for Continuous Speech Recognition, Vítor Pera and Carlos Espain, Proc. REC Workshop, Lisbon, May 2000.
- Topic Detection in Spoken Documents, Rui Amaral and Isabel Trancoso, accepted for publication in ECDL'2000 - 4th European Conference on Research and Advanced Technology for Digital Libraries, Lisbon, September 2000.
- A Decoder for Finite-State Structured Search Spaces, Diamantino Caseiro and Isabel Trancoso, accepted for publication in ASR'2000 - ISCA Tutorial and Research Workshop on Automatic Speech Recognition: Challenges for the new Millenium>, Paris, France, September 2000.
- An Overview of the REC Project - Speech Recognition Applied to Telecommunications Isabel Trancoso, Fernando Perdigão, Carlos Espain, Diamantino Caseiro, Rui Amaral, Frederico Rodrigues, António Serralheiro, Eduardo Sá Marta, Vítor Pera and Luís Moreira, accepted for publication in PROPOR'2000 - V Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada, São Paulo, Brazil, November 2000.
Besides the above mentioned papers, some oral presentations have taken place in which project REC, given its wide scope, has deserved a special mention:
- Reconhecimento de fala em Português Europeu, Isabel Trancoso, invited talk at PROPOR'98 - III Encontro para o Processamento Computacional do Português, Porto Alegre, Brazil, November 1998.
- Processamento de fala em Português, Isabel Trancoso, National Seminar "Dar voz à Sociedade de Informação", Lisbon, February 1999.
Please note: although the theses names were translated to English, all the theses mentioned below were written in Portuguese.
During the first year of the project, two "Mestrado" theses were presented in the scope of two of the tasks:
- "Automatic annotation of telephone speech", by Rui Amaral (at the time member of the IT team, Coimbra), in February 1998. The thesis' aim was the automatic segmentation and labelling of isolated digit utterances, falling therefore in the scope of task T3.
- "Automatic spoken language identification", by Diamantino Caseiro (INESC), in April 1998. The work of task T6 was mostly done in the scope of this thesis.
During the second year of the project, two "Doutoramento" theses were presented in the scope of two of the tasks:
- "Models of the perifery auditory system for automatic speech recognition", by Fernando Perdigão (IT), in June 1998 (task T3).
- "Recognition of Non-Native Speech", by Carlos Teixeira (INESC), in March 1999 (task T2).
During the third year of the project, two "Mestrado" theses were presented in the scope of two of the tasks:
- "Large vocabulary speech recognition for European Portuguese", by Manuel João Silva (INESC), in February 2000 (task T4).
- "Robust recognition of digits and natural numbers", by Frederico Rodrigues (INESC), in July 2000 (task T3).
Work on several other "Doutoramento" and "Mestrado" theses is currently in progress in the scope of some of the project tasks:
- Doctoral thesis of Eduardo Sá Marta (IT) on speaker independent phoneme recognition (task T1).
- Doctoral thesis of Vítor Pera (FEUP) on multi-streaming recognition techniques (task T4).
- Doctoral thesis of Rui Amaral (INESC) on spoken language topic detection, started in 1998 (task T3).
Although indirectly related with the project theme - recognition of speech over the telephone network - the Doctoral thesis of Carlos Ribeiro (INESC), entitled "Phonetic vocoding - Speech coding based on phonetically classified segments", presented in February 2000, should be also mentioned here. In fact, all the work of automatic phonetic segmentation was intimately related with the REC project.
The Master thesis of Ricardo Rodrigues (INESC) on recognition of spoken and spelled proper names, which started in 1997 (tasks T1 and T4), was interrupted by the student in March 2000, in order to pursue his professional career.
Cooperation with other Projects
The cooperation with the projects below (either of national or international scope) has significantly contributed to the progress of REC:
- High quality speech recognition in Portuguese (INESC)
In spite of the fact that this project was mainly oriented towards applications of dictation, the cooperation with this team was particularly important for REC, including many fruitful discussions on acoustic modelling, pronunciations variants, language modelling, etc.
- VODIS - Voice Operated Driver Information Systems (INESC)
The most relevant contribution for the REC project was in terms of robustness, in particular in terms of recognizing speech from non-native speakers.
- DIXI+ - A Portuguese Text-to-Speech Synthesizer For Alternative and Augmentative Communication (INESC)
The cooperation with this project has been very intensive, namely in terms of sharing tools such as the grapheme-to-phone conversion tool for automatically generating pronunciation lexica and the automatic aligner. The latter is crucial for generating automatically segmented spoken corpora which is then used for training better acoustic models in a bootstrap process.
- COST 249 - Continuous speech recognition over the telephone line (IT)
- PAPO - Processamento Automático do Português (FEUP)
Cooperation Project between Portuguese and Brazilian research teams, encompassing both speech recognition and synthesis in the two language variants.
- Portuguse-Spanish Integrated Action E 12/98 (FEUP)
Cooperation with the University of Vigo (Spain) in the context of Man-Machine Dialogue Systems. A decision methodology developed in this university was applied to the context of speaker recognition, for merging results from several speaker recognizers.