Bulgarian National Corpus

159 Last view: 2026-07-03

Bulgarian National Corpus

BulNC

http://ibl.bas.bg/en/BGNC_en.htm,

http://search.dcl.bas.bg/

ID:

801 The Bulgarian National Corpus (BulNC) is a large representative publicly available corpus. It is designed as a uniform framework for texts of different modality (written and spoken), period, and number of languages (monolingual and parallel).
Its core incorporates several electronic corpora, developed in the period 2001-2009 but has been substantially expanded in the following years. The corpus reflects the state of the Bulgarian language (mainly in its written form) from 1945 until the present.
The enlargement of the BulNC has involved not only the amassing of Bulgarian texts, but also the compilation of parallel corpora with Bulgarian as a pivot language. The texts in other languages obligatory have a Bulgarian counterpart in the Bulgarian part of the corpus.
Currently, the corpus core consists of over 1.2 billion words and about 240,000 texts. So far 47 foreign languages have been included totalling about 4.2 billion words. Thus, the overall size of the corpus exceeds 5.4 billion words.
All texts are supplied with extensive metadata description compliant with the established standards. The corpus is supplied with three levels of annotation:
• A detailed metadata description: each text is supplied with editorial (author's name, text title, source, etc.) and classificatory metadata (general category, domain, genre).
• Monolingual annotation: tokenisation, sentence splitting, POS tagging, lemmatisation, word sense annotation.
• Multilingual annotation: alignment at different levels, currently sentence and clause level.
The tagset used in the annotation of the BulNC is available as the Bulgarian tagset.
The Bulgarian part and the Bulgarian-English parallel corpus are tokenised, sentence-split, POS tagged and lemmatised; the Bulgarian part is also word sense annotated. For the time being, the corpora for the other languages are tokenised, sentence-split and aligned.
A special corpus search system allows complex queries to be performed. A set of tools was developed for extracting the metadata and compiling the corpus description from the markup formats. The metadata are as detailed as possible in order to ensure easy text classification, corpus evaluation, derivation of subcorpora based on a set of criteria (e.g. publishing year, domain), and others.
The Bulgarian National Corpus Collocation service (http://dcl.bas.bg/collocations/?cmd=collocations&word=%D0%BD%D0%B5%D1%82) gives access to the Bulgarian National Corpus. The service employs the free-of-charge NoSketchEngine, a system for corpora processing that combines Manatee and Bonito. The Collocation service is a RESTful webservice, supporting complicated queries through http. Example: http://dcl.bas.bg/collocations/?cmd=collocations&word=нет
user: bulnc
pass: bulnc
The query returns the collocations of a given word in the NoSketchEngine format.
The system also supports additional arguments, namely all that are accepted by NoSketchEngine, provided with default values.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 01/02/2008

Licence

Other

Restrictions: Academic - Non Commercial Use

Execution location: hidden

Distribution Access/Medium: Web Executable

Other

Restrictions: Academic - Non Commercial Use

Execution location: hidden

Distribution Access/Medium: Accessible Through Interface

IPR Holder

Institute for Bulgarian Language

Contact Person

Ivelina Stoyanova

text

Monolingual text corpusLanguages

Bulgarian

Linguality

Linguality type: Monolingual

Size

1,202,209,147 Tokens

Character encoding

UTF - 8

Modalities

Written Language

AnnotationSegmentation

Segmentation level: Paragraph, Sentence, Word

Lemmatization

Segmentation level: Word

Semantic Annotation

Segmentation level: Word

Morphosyntactic Annotation - Pos Tagging

Segmentation level: Word

Semantic Annotation - Word Senses

Segmentation level: Word

Resource Creation

Resource Creator

Institute for Bulgarian Language

Funding Project

Bulgarian National Corpus project

URL: http://ibl.bas.bg/en...

Funding Type: National Funds

Funding Country: Bulgaria

Project duration: 17/12/2009 - 17/06/2013

Central and South-East European Resources (CESAR)

URL: http://cesar.nytud.hu/

Funding Types: Eu Funds, Own Funds

Project duration: 01/02/2011 - 30/01/2013

Metadata

Created: 20/11/2011

Last Updated: 01/02/2013

Version

Version: 5.0

Last Updated: 20/01/2013

ValidationValidated

Usage

Access tools

http://dcl.bas.bg/Bu...

Foreseen UseNlp ApplicationsHuman UseActual Use - Nlp ApplicationsActual Use - Human Use

Documentation

Tool Documentation: Help Functions, Manual, Online

Koeva, Svetla, Diana Blagoeva, Sia Kolkovska. Levels of annotation in the Bulgarian National Corpus. – Prace Filologiczne, 2012, LXIII, pp. 147-153. ISSN: 0138-0567.

Blagoeva, Diana, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva. The Bulgarian National Corpus and its application in Bulgarian academic lexicography. – Prace Filologiczne, 2012, LXIII, pp. 37-49. ISSN: 0138-0567.

Koeva, Svetla, Ivelina Stoyanova, Svetlozara Leseva, Tsvetana Dimitrova, Rositsa Dekova, Ekaterina Tarpomanova. The Bulgarian National Corpus: Theory and practice in corpus design. – Journal of Language Modelling, 2012, 1 (1), pp. 65-110. ISSN: 2299-8470.

Koeva, Svetla, Angel Genov. Bulgarian language processing chain. In Proceedings of Integration of Multilingual Resources and Tools in Web Applications. Proceedings of a Workshop in conjunction with GSCL 2011, University of Hamburg, 2011.

People who looked at this resource also viewed the following:

Resources from the same project