PortMedia French and Italian corpus

77 Last view: 2026-06-25

PortMedia French and Italian corpus

View resource name in all available languages

Corpus PortMedia français et italien

http://catalog.elra.info/product_info.php?products_id=1224,

http://catalog.elra.info/product_info.php?products_id=1225

ID:

ELRA-S0371

The PortMedia French and Italian corpus was produced by ELDA, with the same paradigm and specifications as the MEDIA speech database (ELRA-S0272) but on a different domain.

The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of touristic information and reservation (ticket reservation within the 2010 Festival d’Avignon for French and hotel reservation for Italian).

The corpus contains 700 transcribed dialogues from about 140 French speakers and 604 transcribed dialogues from about 150 Italian speakers (several dialogues per speaker).

The database is formatted following the SpeechDat conventions and it includes the following items:
• 700 recorded sessions for French and 604 sessions for Italian. The signals are stored in a stereo wave file format. Each of the two speech channels is recorded at 8 kHz with 16 bit quantization with the least significant byte first (“lohi” or Intel format) as signed integers.
• Manual transcription of each session in HTML format. Label files were created with the free transcription tool Transcriber (TRS files).
• A manual semantic annotation of the corpus. It has been produced with Semantizer, which is also provided with the data.

View resource description in all available languages

Le corpus PortMedia français et italien a été produit par ELDA, avec les même paradigmes et spécifications que la base de données MEDIA (ELRA-S0272) mais pour un domaine différent.

La méthode choisie pour la construction du corpus est celle d’un système « magicien d’Oz ». Elle consiste à simuler un dialogue homme-machine en langage naturel. Le scénario est construit pour le domaine de l’information touristique et la réservation (réservation de billets dans le cadre du Festival d’Avignon 2010 pour le français et réservation d’hôtels pour l’italien).

Le corpus comprend 700 dialogues transcrits d’environ 140 locuteurs français et 604 dialogues transcrits d’environ 150 locuteurs italiens (plusieurs dialogues par locuteur).

La base de données a été formatée d’après les conventions SpeechDat et contient les éléments suivants:
• 700 sessions enregistrées pour le français et 604 sessions pour l’italien. Les signaux sont stockés au format de fichier wave stéréo. Les deux canaux de parole sont enregistrés en 8kHz 16 bit, avec l’octet le moins significatif en premier (“lohi” ou format Intel) en entiers (signés).
• La transcription manuelle de chaque session au format HTML. Les fichiers d’étiquetage ont été créés à partir de l’outil de transcription Transcriber (fichiers TRS), libre de droit.
• L’annotation sémantique manuelle du corpus. Elle a été produite avec Semantizer, qui est également fourni avec les données.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 23/07/2014

Licence

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Non Members of ELRA

Fee: 25,000.00

User Nature: Commercial

ELRA EVALUATION

Restrictions: Evaluation Use

For Non Members of ELRA

Fee: 6,500.00

User Nature: Commercial

ELRA VAR

Restrictions: Commercial Use

For Members of ELRA

Fee: 20,000.00

User Nature: Academic

ELRA EVALUATION

Restrictions: Evaluation Use

For Members of ELRA

Fee: 1,000.00

User Nature: Academic

ELRA VAR

Restrictions: Commercial Use

For Non Members of ELRA

Fee: 25,000.00

User Nature: Commercial

ELRA EVALUATION

Restrictions: Evaluation Use

For Members of ELRA

Fee: 1,000.00

User Nature: Commercial

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Members of ELRA

Fee: 20,000.00

User Nature: Commercial

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Members of ELRA

Fee: 300.00

User Nature: Academic

ELRA VAR

Restrictions: Commercial Use

For Members of ELRA

Fee: 20,000.00

User Nature: Commercial

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Non Members of ELRA

Fee: 2,000.00

User Nature: Academic

ELRA VAR

Restrictions: Commercial Use

For Non Members of ELRA

Fee: 25,000.00

User Nature: Academic

ELRA EVALUATION

Restrictions: Evaluation Use

For Non Members of ELRA

Fee: 6,500.00

User Nature: Academic

Contact Person

Mapelli Valérie

audio

Monolingual audio corpusLanguages

Italian French

Linguality

Linguality type: Monolingual

Size

no size available

Domains

tourism

AnnotationOther

Metadata

Created: 12/05/2005

Version

Version: 1.0

Last Updated: 23/07/2014

Usage

Actual Use - Nlp Applications

Details: Tourism

People who looked at this resource also viewed the following: