SPEECHDAT
The SPEECHDAT corpus collection for European Portuguese was divided
into 2 phases: collection of 1000 telephone calls (preparatory MLAP
Project SPEECHDAT-M); and collection of 4000 telephone calls (Language
Engineering Project SPEECHDAT-II). The project incorporates databases
from all official languages of the E. U. and some major dialectal
variants. The work was done by INESC under a subcontract with Portugal
Telecom.
Goal: realistic corpus for training and assessment of
isolated and continuous speech utterances (whole word or subword
approaches), which can be used for developing voice driven
teleservices.
- Linguistic contents
In the second phase, each speaker is asked to answer 7 spontaneous
questions, some of them related to demographic information (e.g. date
and place of birth) and to read a prompt sheet with 33 items. 4000
different prompt sheets were produced. The material for each speaker
comprises (see example of prompt sheet)
:
- 3 application words (chosen from a vocabulary of 30)
- 1 sequence of isolated digits
- 4 connected digits (prompt sheet, telephone, credit card, PIN code)
- 1 word spotting phrase
- 1 isolated digit, 1 natural number and 1 currency amount
- 3 spelled words (1 spontaneous (forename) + 2 read)
- 5 directory assistance names (2 spontaneous (forename and place of birth) + 3 read)
- 2 questions (predominantly yes and no, but also fuzzy answers)
- 3 dates (1 spontaneous (date of birth) + 2 read)
- 2 time phrases (1 spontaneous (time of day) + 1 read)
- 4 phonetically rich words (chosen from a set of 4000)
- 9 phonetically rich sentences (chosen from a set of 3600)
- Number and type of speakers
Speaker selection is done among employees of Portugal Telecom and
their relatives and friends, achieving a broad regional coverage.
The age distribution exceeds 20% for the 3 main age groups
considered: 16-30, 31-45 46-60. Gender distribution is close to ideal
(47% male and 53% female).
- Data collection
The design of the collection platform (PC with 2 Dialogic boards) and
the data collection itself are the responsibility of INESCTEL.
PCM A-law format was adopted.
- Annotation
Each speech file has an ASCII SAM label file with
information about calling session, recording
conditions, speaker sex, age and accent, signal file, recording date
and time, assessment codes and label file body itself. This
includes the prompting script and the orthographic transcription.
A pronunciation lexicon with citation phonemic transcriptions for each
word is also produced.
- Packaging
The corpus material for the first phase is stored in 3 CDROMs (one for
the phonetically rich sentences), using compressed signal files. The
first 1000 speakers of the second phase are stored in 3 CDROMs, but
the signal files were not compressed.
Webpage of the SPEECHHDAT-II project with recordings of the Portuguese database