LX-Rare Word Similarity Dataset – META-SHARE

Last view: 2026-06-25

26 Last view: 2026-06-25

LX-Rare Word Similarity Dataset

The LX-Rare Word Similarity Data set was created from Stanford Rare Word (RW) Similarity data set (Luong et al., 2013). This list contains 2 034 words (1 017 pairs of words). All the words were extracted from Wikipedia and from WordNet (Miller, 1995), a lexical database where the concepts are grouped into sets of synonyms.
The construction of this list followed this procedure: a) firstly, a list of rare words was selected from Wikipedia, b) after that, each rare word was paired with a related word picked from WordNet. Rare words are those words that have between 5 000 to 10 000 occurrences in Wikipedia.
In the end, the result was a set of word pairs in which one of the words is rare and the other one, which can be rare or not, is related to the first word by some WordNet relation - it can be an hyponym, hyperonym, meronym, holonym or attribute of the former.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Contact Person

António Branco

text

Monolingual text corpusLanguages

Portuguese

Linguality

Linguality type: Monolingual

Size

2,034 Words

Modalities

Written Language

Metadata

Created: 30/01/2017

Last Updated: 30/01/2017

Metadata Language: English (en)

Version

Version: 1.0

Last Updated: 30/01/2017

People who looked at this resource also viewed the following: