The EUROM.1 corpus for European Portuguese was collected in the framework of the SAM_A European project, jointly by INESC and CLUL. This project was in fact an extension of a preliminary project (SAM - Speech Assessment Methods) during which work on the planning of a poly-language resource for the Spoken Language Engineering needs of the European Union was first started. Despite its main use for recognition and synthesis research, this corpus has also been used in our group for phonetic coding research.
For each of the 11 languages contemplated in this project, 4 types of corpus material were collected:
The corpus was structured into 3 target corpora subsets:
The speakers were selected to cover a wide range of age groups and normal voice types. One main accent group was selected (Lisbon area), together with a small number of speakers from other accent regions.
The recordings were made in an anechoic chamber using a high quality microphone, directly to disc (using an A/D board), and to DAT tape. The EUROPEC program was adopted, prompting the items to be read on the computer screen. The sampling frequency was 20 kHz. Calibration followed the SAM recommendations as well. Careful monitoring was adopted.
The SAM project defined the format of the label files which were produced. Besides the orthographic transcription, these included information about the signal file and the recording session, among other items.
The corpus is contained into 5 CDROMs and totals 2.6 Gb.