RTP, as the Portuguese data provider in this project, was responsible for collecting the data at their premises. INESC ID was responsible for defining a schedule for the recordings, helping training the annotators, verifying the annotation and for packaging the data. 4VDO was responsible for defining and setting up the recording conditions. The orthographic transcription process was jointly done by RTP and INESC ID, and made using the Transcriber tool, following the LDC Hub4 (Broadcast Speech) transcription conventions.
The corpus has 3 main parts:
The Speech Recognition Corpus was collected from November 2000 through January 2001, including 122 programs of different types and schedules and amounting to 76h of audio data. The training data of the speech recognition corpus was recorded during October and November of 2000 (61 hours). The development data was recorded in one week in December (8 hours) and the evaluation data during one week in January (6 hours).
The orthographic transcriptions of this corpus were first automatically produced and later manually verified.
The Topic Detection Corpus contains data related to 133 TV broadcast of the 8 o'clock evening news program. It comprises close to 300 hours of recordings on a daily basis and over a period of 9 months, starting in February 2001.
For the Topic Detection Corpus, we only have the automatic orthographic transcriptions and the manual segmentation and indexation of the stories made by the RTP staff in charge of the daily program indexing. Each show was manually segmented into stories and each story was manually classified according to a thematic, geographic and onomastic (names of persons, companies and institutions) thesaurus. Commercial breaks were annotated as non-news data. The thesaurus is currently structured into 21 thematic areas, each of them hierarchically divided. The structure of this thesaurus follows rules which are generally adopted within EBU (European Broadcast Union).