C-ORAL-ROM - Integrated reference corpora for spoken romance languages. Multi-media edition; tools of analysis; standard linguistic measurements for validation in HLT
View resource name in all available languages
C-ORAL-ROM - Corpus oral de référence intégrés pour les langues romanes. Edition multimédia ; outils d'analyse, mesures linguistiques standards pour la validation en HLT
The C-ORAL-ROM resource is a multilingual corpus of spontaneous1 speech for the main romance languages of around 1,200,000 words (IST 2000-26228). The resource comprises three components:
The corpus consists of four comparable recording collections of Italian, French, Portuguese and Spanish spontaneous speech sessions (around 300,000 words for each Language). The collections are delivered respectively by the following providers:
* Università di Firenze (Dipartimento di Italianistica, LABLITA);
* Université de Provence (Description Linguistique Informatisée sur Corpus);
* Fundação da Universidade de Lisboa/Centro de Linguística da Universidade de Lisboa
* Universidad Autónoma de Madrid (Departamento de Lingüística, Lenguas Modernas, Lógica y F. de la Ciencia, Laboratorio de Lingüística Informática).
The C-ORAL-ROM corpus provides the acoustic source of each session together with the following main annotations:
* The orthographic transcription, in CHAT format, enriched with the tagging of terminal and non terminal prosodic breaks
* Session metadata
* The text to speech synchronization, in WIN PITCH CORPUS format, based on the alignment of each transcribed utterance,
The multimedia corpus comes with the speech software Win Pitch Corpus (© Pitch France. Minimal configuration: Pentium III, 1 GHz, 252 mega Ram, S-blaster or compatible sound card, running under Windows 2000 or XP only. GDPLUS.dll installed on the same directory of the program required).2 A series of appendix are also provided containing: a) the purely textual corpus in .TXT and .XML format; b) the PoS tagging of all and the corresponding frequency list of lemmas forms in .TXT files; c) a set of linguistic measurements extracted from the main corpus annotations, in .EXCEL files; d) the specifications and validation of the resource, e) corpus metadata.
1. DVDs 1 to 8 contain the multimedia corpus edition (DVDs1-2 French; DVDs 3-4 Italian; DVDs 5-6 Portuguese; DVDs 7-8 Spanish). All collections have the same folder's structure, that mirrors directly the C-ORAL-ROM corpus design (see. below). For each session into folders the following is delivered:
* the uncompressed .WAV files (Windows PCM: 22,050 hz; 16 bit)
* the .TXT file of the transcripts;
* the .XML file defining the text to speech alignment in WIN PITCH CORPUS format and its .DTD
2. The CD contains the speech software and the Appendix:
The speech software Win Pitch Corpus (10 licenses)
The C-ORAL-ROM transcription files in .TXT and .XML format
The C-ORAL-ROM transcription files with PoS tagging in .TXT files
The frequency list of lemmas for each language collection in TXT files
Measurements of spoken language variability in EXCEL files
The Corpus specifications:
c)Dialogue representation format;
g)PoS tagging and lemma formats
Resource Validation reports
Multimedia sample files
The resource aims to represent the variety of speech acts performed in everyday language and to enable the induction of prosodic and syntactic structures in the four romance languages, from a quantitative and qualitative point of view. The resource has been designed for prosodic modeling, test bed procedures in HLT and corpus based studies of spontaneous speech. C-ORAL-ROM have a relevant added value at the following levels:
* Corpus design
* Dialogue representation
* Prosodic annotation
* PoS tagging
* Multimedia storage
* Speech analysis
The corpus design of the C-ORAL-ROM resource aim to ensure a possibility of occurrence for a large variety of speech act typologies and natural prosodic contours, which are the most peculiar linguistic feature found in spontaneous speech. To this end the main variation parameters of the spoken domain (Channel variation, Dialogue structure, sociological domain of use, and semantic domain of application) are represented in a corpus design schema, covering a wide range of semantic and pragmatic domains of application.
The four language collection are considered comparable as far as they fit with the corpus design schema. More specifically each language collection in the C-ORAL-ROM corpus is consistent with the following average structure (check documentation for deviations):
INFORMAL/150,000 words from at least 64 texts of 1500 words each and 10 texts of 4500 words each
INFORMAL/ Family-Private context/124,500 words
INFORMAL/Family-Private context/ Monologues/42,000 words
INFORMAL/Family-Private context/Dialogues-Conversations /82,500 words
INFORMAL/Public context /25.500 words
INFORMAL/Public context/Monologues/6,000 words
INFORMAL/Public context/ Dialogues-Conversations/19,500 words
FORMAL 150,000 words
FORMAL/Formal in natural context/2 or 3 samples of 3000 words average for each of the following typical domain of use for 65,000 words in total.
FORMAL/Formal in natural context/ political speech
FORMAL/Formal in natural context/ political debate
FORMAL/Formal in natural context/ preaching
FORMAL/Formal in natural context/ teaching
FORMAL/Formal in natural context/professional explanation
FORMAL/Formal in natural context/ conference
FORMAL/Formal in natural context/ business
FORMAL/Formal in natural context/law (through media allowed)
FORMAL/Media context/2 or 3 samples of 3000 words average for each of the following typical domain of use for 60,000 words in total
FORMAL/Media context/news (small sample)
FORMAL/Media context/meteo (small sample)
FORMAL/Media context/scientific press
FORMAL/Media context/sport talk shows
FORMAL/Media context/political debate
FORMAL/Media context/talk shows thematic discussions
FORMAL/Media context/talk shows culture
FORMAL/Media context/talk shows science
FORMAL/Telephone 25,000 words3
FORMAL/Telephone/phone to call services or man-machine interaction (10,000 words) 4
For each session a rich series of metadata is delivered in CHAT format, ensuring multitask exploitation of the resource for Linguistics and Human language technologies. Metadata contain essential information regarding the speakers, the recording situation, the topic, the acoustic quality, the source of the collected data .
Corpora are orthographically transcribed in standard textual format (CHAT format; Mac Whinney, 1994) with the annotation of speaker's turns. The textual string is divided into utterances. The main non linguistic and paralinguistic acoustic events in the speech flow are reported into transcripts
The four romance collections are completely tagged with respect to prosodic breaks. Terminal and non terminal breaks, are discriminated through perceptive judgments and reported in the transcripts. The level of inter-annotator agreement on prosodic tags assignment has been validated by an external institution.
The multimedia storage ensures a natural and meaningful text / sound correspondence for both prosodic modeling, test bed procedures and corpus based studies of spontaneous speech.
Win Pitch Corpus is an innovative software program for computer-aided alignment of large corpora. It provides a method for easy and precise selection of alignment units, ranging from syllable to whole sentences in a hierarchical storing system of aligned data. The method is based on the ability to link visually a moving target with the perception of corresponding speech sound played back at a rate reduced by at least 30% or more.
Segments derived from alignment can be defined on 8 independent layers, with automatic generation of the corresponding database, which can be saved directly in both XML and Excel formats. Besides text to speech alignment, Win Pitch Corpus, which is Unicode compliant, has numerous features allowing easy and efficient acoustical analysis of speech, such as real-time fundamental frequency tracking, spectrographic display, re-synthesis after editing of prosodic parameters, etc...
For more information: http://www.elda.org/en/proj/coralrom.html
(1) As defined according to C-ORAL-ROM as: comprising formal and informal speech.
(2) ELDA does not take responsibility on software products coming with the distributed resources. Pitch France is fully responsible for this Software.
(3) text length not defined (by preference 1500 words upper limit, no lower limit)
(4) Field not present in the Portuguese corpus. The texts in this field are not delivered aligned to the acoustic source.
View resource description in all available languages