Bulgarian Part-of–Speech Corpus

78 Last view: 2026-06-25

Bulgarian Part-of–Speech Corpus

BulPosCor

http://dcl.bas.bg/poscor/en/

ID:

803 The Bulgarian Part-of–Speech Corpus (BulPosCor) is derived from the Brown Corpus of Bulgarian, automatically annotated respectively with PoS tags and manually disambiguated. The corpus for annotation was built by selecting portions of 150+ words from each sample from the Brown Corpus of Bulgarian. The automatic grammatical annotation of the corpus employed the Bulgarian Grammar Dictionary containing about 85 000 words and over 1.5 million word forms specified with grammatical characteristics.

Disambiguation was performed by human experts that assigned the correct PoS tags out of two or more possible for an ambiguous token. A number of annotation principles had been outlined in order to provide a uniform approach to the annotation. As a result a PoS disambiguated corpus was obtained consisting of 217 210 tokens, including 172 482 single words, 42 058 punctuation marks and 2 670 numbers.

The chief intended application of the Bulgarian Tagged Corpora is to serve as a test and/or training dataset for PoS disambiguation.
The Tagged Corpus enables efficient online search of language patterns and forms as well.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 20/11/2011

Licence

Other

Restrictions: Academic - Non Commercial Use

Execution location: hidden

Distribution Access/Medium: Accessible Through Interface

IPR Holder

Institute for Bulgarian Language

Contact Person

Ivelina Stoyanova

text

Monolingual text corpusLanguages

Bulgarian

Linguality

Linguality type: Monolingual

Size

217,000 Tokens

Character encoding

UTF - 8

Modalities

Written Language

AnnotationSegmentation

Segmentation level: Sentence

Morphosyntactic Annotation - Pos Tagging

Segmentation level: Word

Lemmatization

Segmentation level: Word

Resource Creation

Resource Creator

Institute for Bulgarian Language

Funding Project

Bulgarian National Corpus project

URL: http://ibl.bas.bg/en...

Funding Type: National Funds

Funding Country: Bulgaria

Project duration: 17/12/2009 - 17/06/2013

Central and South-East European Resources (CESAR)

URL: http://cesar.nytud.hu/

Funding Types: Eu Funds, Own Funds

Project duration: 01/02/2011 - 30/01/2013

Metadata

Created: 20/11/2011

Last Updated: 31/01/2013

Version

Version: 1.0

Last Updated: 20/11/2011

ValidationValidated

Usage

Foreseen UseNlp ApplicationsHuman UseActual Use - Nlp ApplicationsActual Use - Human Use

Documentation

Koeva, Svetla, Svetlozara Leseva, Ivelina Stoyanova, Ekaterina Tarpomanova, Maria Todorova, Bulgarian Tagged Corpora. - In: Proceedings of the Fifth International Conference Formal Approaches to South Slavic and Balkan Languages, 18-20 October 2006, Sofia, Bulgaria, 2006, pp. 78-86.

Koeva, Svetla, Diana Blagoeva, Sia Kolkovska. Levels of annotation in the Bulgarian National Corpus. – Prace Filologiczne, 2012, LXIII, pp. 147-153. ISSN: 0138-0567.

People who looked at this resource also viewed the following:

Resources from the same project