BDFALA
The BDFALA corpus was jointly developed by INESC and
CLUL in the framework of the national project sponsored by JNICT
(Program Lusit‰nia). Goal: enlargement of the EUROM.1 corpus, mainly
for the improvement of speech synthesis systems.
- Linguistic contents
6 types of corpus material were collected:
- ~4600 isolated words
- 350 sentences for prosodic studies
- 18 phonetically-complete paragraphs
- 60 read paragraphs extracted from television debates
- ~3000 logatomes
- 600 phonetically rich sentences
- Number and type of speakers
The 8 speakers were selected to achieve a balance in terms of sex, age
groups and, as much as possible, among speakers of the EUROM.1
corpus. The two latter corpus types were only spoken by one male and
one female speakers. A subset was also read by two young speakers (one
male and one female), 12-14 years old, which were also recorded in
EUROM.1.
- Data collection
Data collection took place in a sound-proof room. Two recording modes
were adopted: in the case of isolated words and logatomes, the
material was read from paper and recorded directly to DAT. The speech
material was semi-automatically segmented and validated a
posteriori. In the second mode (remaining sentences and paragraphs), a
self-monitoring program was adopted which recorded directly into
disc. The recordings were duly calibrated in both cases. The sampling
frequency was 16 kHz.
- Annotation
For each spoken item, the corresponding orthographic script is saved
in a separate ASCII file. A pronunciation lexicon with citation
phonemic transcriptions for each word is also included. These were
automatically produced and hand-corrected a posteriori.
- Packaging
The corpus material amounts to around 2.4 Gb, and is stored in 4 CDROMs.