EUROM.1
The EUROM.1 corpus for European Portuguese was
collected in the framework of the SAM_A European project, jointly by
INESC and CLUL. This project was in fact an extension of a preliminary
project (SAM - Speech Assessment Methods) during which work on the
planning of a poly-language resource for the Spoken Language
Engineering needs of the European Union was first started.
Despite its main use for recognition and synthesis research, this
corpus has also been used in our group for phonetic
coding research.
- Linguistic contents
For each of the 11 languages contemplated in this project, 4 types of
corpus material were collected:
- CVC material (totalling 121 different logatomes) in isolation
and in context (5 carrier phrases)
- 100 selected numbers from 0-9999
- 40 short passages each containing 5 thematically connected
sentences (half of the passages were freely translated from the
English version of EUROM.1; most of the remaining ones were adapted
from Portuguese books and newspapers)
- 50 filler sentences to compensate for the phoneme-frequency imbalance in the passages
- Number and type of speakers
The corpus was structured into 3 target corpora subsets:
- Many Talker Corpus (30 male + 30 female speakers): 100 numbers, 3
passages, 5 sentences
- Few Talker Corpus (5 male + 5 female, selected from MANY):
5 x CVCs, 5 x 100 numbers, 15 passages and 25 sentences.
- Very Few Talker Corpus (1 male + 1 female selected from
FEW): CVC in context.
The speakers were selected to cover a wide range of age groups and
normal voice types. One main accent group was selected (Lisbon area),
together with a small number of speakers from other accent regions.
- Data collection
The recordings were made in an anechoic chamber using a high quality
microphone, directly to disc (using an A/D board), and to DAT
tape. The EUROPEC program was adopted, prompting the items to be read
on the computer screen. The sampling frequency was 20 kHz. Calibration
followed the SAM recommendations as well. Careful monitoring was
adopted.
- Annotation
The SAM project defined the format of the label files which were
produced. Besides the orthographic transcription, these included
information about the signal file and the recording session, among
other items.
- Packaging
The corpus is contained into 5 CDROMs and totals 2.6 Gb.