The BREF corpus was designed to provide enough read speech data for the development and evaluation of continuous speech recognition systems (both speaker-dependent and speaker-independent), and to provide a large corpus of continuous speech for the acquisition of acoustic-phonetic knowledge of spoken French. All the recorded texts were selected from extracts of the French newspaper Le Monde so as to provide a large vocabulary (over 20,000 words) and a wide range of phonetic environments. The entire BREF corpus contains over 100 hours of speech material from 120 speakers.
The BREF-80 sub-corpus consists of 2 ISO9660 CDROMs, BREF80-1 and BREF80-2, containing speaker-independent training data from 80 speakers. Together these 2 CDs contain 5330 sentences, an average of 67 sentences per speaker. While this data represents only a small portion of the entire BREF corpus, the sentences have been selected to cover most of the BREF training prompts, in order to conserve a wide range of phonetic contexts with a minimum amount of speech data. Thus, the BREF80 sub-corpus produced on these CDs was especially selected to train speaker-independent, vocabulary-independent speech recognizers.
View resource description in all available languages