Estonian Dialect Corpus – META-SHARE

Last view: 2026-07-03

59 Last view: 2026-07-03

Estonian Dialect Corpus

View resource name in all available languages

Eesti murdekorpus

http://www.murre.ut.ee/estonian-dialect-corpus/

ID:

http://hdl.handle.net/11297/1-00-0000-0000-0000-0002-A

doi:10.15155/TY.0007

The dialect corpus consists of:

1) Dialect recordings. The corpus is based on dialect recordings which have mainly been made in the 1960s and 1970s. The first recordings are even earlier – they date from 1938. The recordings are traditional dialect recordings where the interview is conducted at the home of the informant.

2) Phonetically transcribed texts. The traditional Finno-Ugric phonetic transcription is used. The texts are available as Word and pdf files (by the 1st of May 2011, there are about 1,284,000 text words in the corpus).

3) Dialect texts in simplified transcription. All of the phonetically transcribed texts have been transported one-to-one into the simplified transcription (.txt), which enables the use of these texts with every program and to conduct primary analyses.

4) Morphologically tagged texts which have been read into a MySQL database. All the word classes and morphological forms are tagged;

5) Database containing information about informants and recordings;

6) Syntactically parsed texts (about 40000 text words).

In the corpus, every phonetically transcribed text is accompanied by a recording, a file in simplified transcription and a description; more than half of the texts are also accompanied by a morphologically tagged file.

Also some data from other Finnic languages which are spoken around Estonia have been added. The aim is to incorporate at least Votic, Ingrian and Livonian data to the corpus.

View resource description in all available languages

korpus

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

CLARIN ACA

Execution location: hidden

User Nature: Academic

Distribution Access/Medium: Accessible Through Interface

Contact Person

Liina Lindström

text
audio

Monolingual text corpusLanguages

Estonian

Linguality

Linguality type: Monolingual

Size

1,284,000 Words

Modalities

Spoken Language

Monolingual audio corpusLanguages

Estonian

Linguality

Linguality type: Monolingual

Size

113 Hours

Modalities

Spoken Language

Content

Speech items: Free Speech

Metadata

Created: 09/01/2013

Last Updated: 22/05/2015

Revision: 6

Metadata Creator

People who looked at this resource also viewed the following: