The Bulgarian-English Sentence- and Clause-Aligned Corpus

141 Last view: 2026-06-25

The Bulgarian-English Sentence- and Clause-Aligned Corpus

BulEnAC

http://dcl.bas.bg/en/clauseAlignedCorpus_en.html

ID:

828 The Bulgarian-English Sentence- and Clause-Aligned Corpus (BulEnAC) is an excerpt from the Bulgarian-English Parallel Corpus – a part of the Bulgarian National Corpus (BulNC). The Bulgarian-English Parallel Corpus has been processed at several levels: tokenisation, sentence splitting, lemmatisation. The processing has been performed using the Bulgarian language processing chain for the Bulgarian part and Apache OpenNLP and Stanford CoreNLP for the English part.

The BulEnAC consists of 176,397 tokens for Bulgarian and 190,468 for English (366,865 tokens altogether). Sentences are 30,385 (14,667 Bulgarian sentences (12.02 words per sentence on average) and 15,718 English sentences (12.11 words per sentence). The average number of clauses in a sentence in the Bulgarian part is 1.67 compared to 1.85 clauses per sentence for the English part.

The texts are distributed over five broad categories, called 'styles': administrative, fiction, science, journalism, and subtitles. The corpus is represented in XML format and is supplied with various linguistic annotation – monolingual for both Bulgarian and English (sentence splitting, tokenisation, lemmatisation, POS and grammatical tagging), and parallel (sentence and clause alignment).

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

Other, Under Negotiation

Restrictions: Other

User Nature: Academic

Download location: hidden

Distribution Access/Medium: Downloadable

Execution location: hidden

Contact Person

Svetla Koeva

text

Bilingual text corpusLanguages

English (190,468 Tokens)

Language Script: Latn

Bulgarian (176,397 Tokens)

Language Script: Cyrl

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Size

30,385 Sentences

Modalities

Written Language

AnnotationSegmentation

Segmentation level: Clause, Paragraph, Sentence, Word

Semantic Annotation - Word Senses

Segmentation level: Word

Morphosyntactic Annotation - Pos Tagging

Segmentation level: Word

Alignment

Segmentation level: Clause, Sentence

Resource Creation

Resource Creator

Institute for Bulgarian Language

Funding Project

Bulgarian National Corpus project (BulNC)

Funding Type: National Funds

Central and South-East European Resources (CESAR)

URL: http://cesar.nytud.hu

Funding Type: Eu Funds

Project duration: 01/02/2011 - 30/01/2013

Metadata

Created: 30/01/2013

Last Updated: 01/02/2013

Version

Version: 1.0

Documentation

Tool Documentation: Online

Koeva, Svetla, Borislav Rizov, Ekaterina Tarpomanova, Tsvetana Dimitrova, Rositsa Dekova, Ivelina Stoyanova, Svetlozara Leseva, Hristina Kukova, Angel Genov. Bulgarian-English Sentence- and Clause-Aligned Corpus. – In: Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2), Lisbon: Edicoes Colibri, 2012, pp. 51-62.

People who looked at this resource also viewed the following:

Resources from the same project

Resources from the same creators