BD-PUBLICO
The BD-PUBLICO database (Base de Dados em Português eUropeu,
vocaBulário Largo, Independente do orador e fala
COntínua) was collected by INESC in the framework of an
European project (SPRACH), and a national project (PRAXIS XXI
Program), and with the collaboration of Instituto Superior
Técnico (IST) and the PÚBLICO newspaper. This corpus
aimed at the development of large vocabulary, speaker-independent
continuous speech recognition systems.
- Linguistic contents
The text material for the read sentences was extracted from the
Portuguese newspaper PÚBLICO, consisting of 6 months of news,
totalling 10M words and 156k different forms.
The corpus is based on 3 sets:
- Training set: 80 sentences plus 3 calibration sentences for each speaker.
- Development set: 40 sentences plus 15 speaker-adaptation
sentences per speaker.
- Evaluation set: 40 sentences plus 15 speaker-adaptation sentences
and 3 calibration sentences for each speaker.
Two vocabulary sizes: 5K and 20K (later recording phase)
- Number and type of speakers
Speaker selection was done among undergraduate and graduate students
from IST. Ages ranged between 19 and 28 and a broad coverage of accents
was obtained.
We recorded a total of 120 speakers with 100 for the training set (50
male and 50 female) and 20 speakers (10 male and 10 female) divided
equally in the 5K word sets (evaluation / development). Each recording
session resulted in approximately 15 minutes of speech.
- Data collection
The recordings were done in a sound proof room at INESC (Lisbon) using
a high quality microphone, directly to disc with 16kHz sampling
frequency.
- Annotation
A pronunciation lexicon with citation phonemic transcriptions for each
word was produced by hand-correcting the automatically generated
transcriptions.
- Packaging
The corpus material amounts to more than 2 Gb, and was packed into 4 CDROMs.