ALERT

The ALERT corpus was collected in the framework of the European project with the same name, with the goal of gathering material for training and evaluating several components of the ALERT media watch system for European Portuguese.

RTP, as the Portuguese data provider in this project, was responsible for collecting the data at their premises. INESC ID was responsible for defining a schedule for the recordings, helping training the annotators, verifying the annotation and for packaging the data. 4VDO was responsible for defining and setting up the recording conditions. The orthographic transcription process was jointly done by RTP and INESC ID, and made using the Transcriber tool, following the LDC Hub4 (Broadcast Speech) transcription conventions.

The corpus has 3 main parts:

Prior to the collection of the SRC and TDC corpora, we collected a relative small Pilot Corpus which was used to discuss and setup the collection process, and the most appropriate kind of programs to collect. This corpus was recorded during one week in April 2000, amounting to 5.5 hours. For the pilot corpus, the audio was recorded at 44.1 KHz at 16 bits/sample. The final corpus was recorded at 32 KHz. Both were later downsampled to 16 kHz. This is the only corpus for which we also collected video data (MPEG-1). Manually corrected orthographic transcriptions and topic labels were added to this corpus.