Phonetic Corpus of Estonian Spontaneous Speech v.1.0.0

64 Last view: 2026-07-03

Phonetic Corpus of Estonian Spontaneous Speech v.1.0.0

View resource name in all available languages

Eesti keele spontaanse kõne foneetiline korpus v.1.0.0

http://www.keel.ut.ee/foneetikakorpus/

ID:

http://hdl.handle.net/11297/1-00-0000-0000-0000-0003-1

doi:10.15155/TY.000D

The aim of the corpus is to compile a large amount of quality recordings of spontaneous Estonian and segment it phonetically on different levels. The project started in autumn 2006.

The total size of the corpus is approximately 60 hours of speech from 100 speakers with different dialectological and social background. Speakers are from different age groups. They are asked to participate with face-to-face invitation and they are aware of the purpose of the recordings.

Most of the recordings are made in a recording studio, some also on fieldwork. The signal of each speaker is recorded in a separate channel. The distance between the speakers is about 3 meters to minimize the effect of overlaps. For the field-work recordings head-set microphones are used. Recordings are saved in PCM wav-format and are not compressed. Background information about the recordings is collected in a text-file.
Segmentation and annotation files are saved as Praat TextGrid files and get same filenames as recordings segmented.

Segmentation and annotation
Segmentation and annotation is done with the Praat program (www.praat.org). Recordings are segmented manually on different levels (automatic segmentation program is also elaborated and tested).
Following tiers are used:
-Words (in orthographic spelling),
-Phonemes (SAMPA adjusted for Estonian is used for transcription),
-Syllables (short – long, open – closed),
-Prosodic feet,
-Intonation phrases or inter-pausal units;
-Changes in voice quality (e.g. creaky voice);

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

CLARIN RES

Execution location: hidden

Distribution Access/Medium: Accessible Through Interface

Contact Person

Pärtel Lippus

text
audio

Monolingual text corpusLanguages

Estonian

Linguality

Linguality type: Monolingual

Size

450 000 Words

Character encoding

UTF - 8

Time Coverage

2006-2015

Monolingual audio corpusLanguages

Estonian

Linguality

Linguality type: Monolingual

Size

60 Hours

Modalities

Spoken Language

AnnotationSegmentation

Annotated elements: Discourse Markers

Segmentation level: Word

Format: orthography

Annotation Mode: Manual

Segmentation

Segmentation level: Word

Segmentation

Segmentation level: Phoneme

Format: SAMPA

Annotation Mode: Manual

Content

Speech items: Free Speech

Noise Level: Low

Setting

Naturality: Spontaneous

Conversational type: Dialogue

Audience: No

Interactivity: Interactive

Audio Formatswav

Recording quality: High

Quantization: 16

Number of tracks: 1

Sampling rate: 44100

Signal encoding: LinearPCM

Resource Creation

Resource Creator

Metadata

Created: 09/01/2013

Last Updated: 02/12/2016

Metadata Creator

Neeme Kahusk

Krista Liin

Version

Version: 1.0.0

People who looked at this resource also viewed the following:

Resources from the same creators