View resource name in all available languages
The CORAL corpus was collected in the framework of a national project sponsored by the PRAXIS XXI program, by a consortium formed by INESC, CLUL, FLUL (Faculdade de Letras da Universidade de Lisboa), and FCSH-UNL (Faculdade de Ciências Sociais e Humanas da Universidade Nova de Lisboa). The purpose of this project is the collection of a spoken dialogue corpus in European Portuguese, with several levels of labelling: orthographic, phonetic, phonological, syntactic and semantic.
- Linguistic Contents:
56 dialogues about a predetermined subject: maps. One of the participants (giver) has a map with some landmarks and a route drawn between them; the other (follower) has also landmarks, but no route and consequently must reconstruct it. In order to elicit conversation, there are small differences between the two maps: one of the landmarks is duplicated in one map and single in the other; some landmarks are only present in one of the maps; and some have slightly different names in the two maps (e.g. curvas perigosas vs. troço sinuoso). In the 16 different maps, the names of the landmarks were chosen to allow the study of some connected speech phenomena:
o Sequences with /l/ favouring or not its velarization (e.g. sala malva, sal amargo)
o Sequences with /s/ in word final position followed by another coronal fricative (e.g. barcos salva-vidas)
o Sequences of plosives formed across word boundaries (e.g. clube de tiro)
o Sequences of obstruents formed within and across word boundaries (e.g. bairros degradados)
The last three items were designed to allow a more comprehensive study of consonant clusters formed within and across word boundaries and should, therefore, be jointly investigated.
- Number and Type of Speakers:
The original 32 speakers were divided into 8 quartets and, in each quartet, organized to take part in 8 dialogues. The available database contains 7 quartets, corresponding to 28 speakers. Given the reduced number of speakers, they were chosen to achieve an adequate balance of sexes, but were restricted in terms of age (under-graduate or graduate students) and accent (Lisbon area). Speakers were chosen in pairs who know each other, so that half of the conversations take place between "friends" and half between people who do not knew each other.
- Data Collection:
The recordings take place in a sound proof room, with no visual contact between the speakers. They wear close-talking microphones and the recordings are made in stereo directly to DAT and later down-sampled to 16 kHz per channel. No monitoring is done once the dialogues start, after adjusting recording levels.
Only orthographic transcription was done for the whole corpus. A pilot recording was annotated in several levels.
Four files per dialogue are provided:
a) two RAW files: audio file
b) two TRS files: containing the manual transcriptions. The TRS format is a kind of XML format that a standard transcription software such as Transcriber can open. Annotations in the TRS files are at word-level. They are fine-grained transcriptions that include disfluencies. The characters in the text files are encoded in ISO-8859-1 (Latin1).
The corpus consists of 112 TRS and corresponding WAV files, and contains about 57K word tokens. The disk size is about 1.5 MB for the TRS files and 1.2 GB for the WAV files.
View resource description in all available languages