End: 30/06/99 (prolongation of 6 months relative to initial planning of 2 years)
INESC (Instituto de Engenharia de Sistemas e Computadores, Lisbon) - Speech Processing Group
CLUL (Centro de Linguística da Universidade de Lisboa)
FLUL (Faculdade de Letras da Universidade de Lisboa)
FCSH-UNL (Faculdade de Ciências Sociais e Humanas da Universidade Nova de Lisboa)
The INESC-CLUL team subcontracted some people for the task of orthographic annotation whose work was directly supervised by Dr. Isabel Mascarenhas:
The purpose of this project is the collection of a spoken dialogue corpus, with several levels of labelling: orthographic, phonetic, prosodic, syntactic and semantic. The corpus should be sufficiently representative in terms of number of speakers, and it should focus on a selected theme in order to a priori limit the vocabulary which is used. This type of corpus is essential to research in spontaneous speech processing, which is characterised by a number of phenomena that seriously difficult its automatic understanding - hesitations, restarts, ill-formed sentences, etc.. It is also essential for the study of dialogues, in particular of their structuring and integration with speech recognition. The project does not envisadge to study all these problems but rather to create a linguistic infra-structure which enables this study in future projects by interdisciplinary research teams, such as the one involved in its creation. It is therefore important that, besides including the transliteration of the entire corpus, with an indication of all the para-linguistic phenomena, the corpus also includes labelling at other levels - phonetic, prosodic, syntactic and semantic. Although there are automatic tools for certain types of labelling, their robustness for spontaneous speech is quite reduced relatively to read speech, which means that most of this work is manual, therefore demanding human resources well above the scope of this program. Hence, only a subset of the corpus will include all the types of labelling.
The project starts by a design phase in which the topic will be chosen, and the number of speakers and other parameters whose variability we wish to study will be specified. This phase will be followed by the collection phase and the successive labelling stages, with some overlap in between them. The project ends with the preparation and packing of the data files for CD-ROM pressing, in order to allow its wide dissemination by the community of Portuguese language researchers.
The project CORAL had as its main achievement the production of a linguistic resource that did not exist for European Portuguese at the time of its proposal - a spoken dialogue corpus, with several levels of labelling, which is sufficiently significant in terms of number of speakers (32, grouped into 8 quartets, amounting to 64 dialogues), and which is focused on a pre-selected theme in order to a priori restrict the scope of the vocabulary (the well known map task).
This type of corpus is, in fact, essential for the progress of research in processing spontaneous speech, which is characterized by several phenomena that seriously affect the task automatic speech understanding. This type of corpus is also important for the study of dialogue, particularly of its structure and relationship with speech understanding in the scope of spoken human-machine interfaces. We think that such a linguistic resource will allow the study of the above mentioned problems in projects to be defined in a later stage.
A systematic exploitation of this corpus ranging from the test of the adequacy of the segmentation/labelling criteria to a more detailed study of the mapping between several analysis levels is clearly beyond the objectives of the proposal.
The corpus is presently available in 5 CDROMs, amounting to 1.6 Gb, if only signal files are accounted for, assuming a sampling frequency of 16kHz. Its availability in wav format is also possible. All dialogues have been annotated orthographically. Only a relatively small subset has been annotated at different levels. The only multi-level annotated dialogue included in the CDROMs is the pilot dialogue. For further information about the corpus and its availability, please contact Isabel Trancoso.
Example of orthographic labelling (pilot dialogue - check maps and general description in the oral presentation of the project mentioned above). Given their length, annotation files at other levels were not included in this page.