BD-PÚBLICO Corpus

From L²F

The BD-PUBLICO database (Base de Dados em Português eUropeu, vocaBulário Largo, Independente do orador e fala COntínua) was collected by INESC in the framework of an European project (SPRACH), and a national project (PRAXIS XXI Program), and with the collaboration of Instituto Superior Técnico (IST) and the PÚBLICO newspaper. This corpus aimed at the development of large vocabulary, speaker-independent continuous speech recognition systems.

Linguistic Contents

The text material for the read sentences was extracted from the Portuguese newspaper PÚBLICO, consisting of 6 months of news, totalling 10M words and 156k different forms.

The corpus is based on 3 sets:

  • Training set: 80 sentences plus 3 calibration sentences for each speaker.
  • Development set: 40 sentences plus 15 speaker-adaptation sentences per speaker.
  • Evaluation set: 40 sentences plus 15 speaker-adaptation sentences and 3 calibration sentences for each speaker.

Two vocabulary sizes: 5K and 20K (later recording phase)

Number and Type of Speakers

Speaker selection was done among undergraduate and graduate students from IST. Ages ranged between 19 and 28 and a broad coverage of accents was obtained.

We recorded a total of 120 speakers with 100 for the training set (50 male and 50 female) and 20 speakers (10 male and 10 female) divided equally in the 5K word sets (evaluation / development). Each recording session resulted in approximately 15 minutes of speech.

Data Collection

The recordings were done in a sound proof room at INESC (Lisbon) using a high quality microphone, directly to disc with 16kHz sampling frequency. Annotation

A pronunciation lexicon with citation phonemic transcriptions for each word was produced by hand-correcting the automatically generated transcriptions.

Packaging

The corpus material amounts to more than 2 Gb, and was packed into 4 CDROMs.