Estonian gap tests

Estonian gap tests corpus represents a collection of sentences, in which one word is marked as a "gap", accompanied with a list of candidate words. The corpus can be used as a benchmark for evaluating language models. The corpus covers both frequent and infrequent gap-words and includes candidate lists generated in different ways. Sentences originate from the Estonian Reference Corpus (http://www.cl.ut.ee/korpused/segakorpus/). The corpus has been tokenized using Estnltk toolkit (https://github.com/estnltk/estnltk).

An archive contains sentence files with an extension ".gaps" and candidate files with an extension "*.var". Sentence file contains one sentence per line. A line starts with an integer which indicates gap-word's offset in a sentence. The position of the first word in the sentence is zero. Based on the frequency of a gap-word, we generated four kinds of sentence files:

File name Gap-word frequency
--------------------------------------------------------------
test.all.gaps any frequency
test.freq.gaps frequent word form
test.inf_freq.gaps infrequent word form, frequent word type (lemma)
test.inf_inf.gaps infrequent word form, infrequent word type (lemma)

To each sentence file relate multiple candidate files. In a candidate file, each line contains a list of 200 candidate words, which correspond to a sentence at the same line in the related sentence file.
Candidate files were generated using the same frequency ranges as sentence files. We also provide four kinds of candidate files:

File suffix Explanation
--------------------------------------------------------------------------------------
*.pos.var candidates with the same part of speech as a gap-word
*.syn.var candidates generated with a morphological generator based on the base form of a gap-word
*.w2v.var candidate words from word2vec's most similar query
*.random.var random words

You don’t have the permission to edit this resource.