Polish Coreference Corpus

57 Last view: 2026-02-10

Polish Coreference Corpus

http://zil.ipipan.waw.pl/PolishCoreferenceCorpus

ID:

438 The Polish Coreference Corpus (PL: Polski Korpus Koreferencyjny) is a result of the "Computer-based methods for coreference resolution in Polish texts" project. It contains short fragments (250-350 segments each) of texts randomly selected (preserving the original text type balance) from the full version of the National Corpus of Polish. These fragments are manually annotated with identity coreferential chains and quasi-identity relations. The corpus is supplied in two xml-based formats: MMAX and TEI. It contains automatic morphosyntactic annotation, in TEI format it also has automatic named entity and shallow parsing annotations.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Unrestricted Use

Licence

CC - BY

Fee: free of charge

Download location: hidden

Distribution Access/Medium: Downloadable

Contact Person

Mateusz Kopeć

text

Monolingual text corpusLanguages

Polish

Linguality

Linguality type: Monolingual

Size

503,985 Tokens

Character encoding

UTF - 8

Modalities

Spoken Language, Written Language

AnnotationSemantic Annotation - Named Entities

StandOff: True

Segmentation level: Word Group

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (named entities (person names, organizations, locations compatible with NKJP hierarchy) detected by Nerf)

Annotation Tools:

Nerf, a named entity recognizer for Polish

Start date: 01/01/2012

Segmentation

StandOff: True

Segmentation level: Sentence, Word

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Mixed (When required for the purpose of coreference annotation, sentence and word segmentation output by Pantera was corrected manually)

Annotation Tools:

Pantera, a Brill tagger for Polish

Start date: 01/01/2012

Lemmatization

StandOff: False

Segmentation level: Word

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (lemma variants (all available interpretations) output by Morfeusz, then disambiguated by Pantera tagger)

Annotation Tools:

Morfeusz SGJP, a tokenizer, moprhological analyzer and lemmatizer for Polish
Pantera, a Brill tagger for Polish

Start date: 01/01/2012

Morphosyntactic Annotation - Pos Tagging

Tagset: NKJP tagset (Polish)

StandOff: True

Segmentation level: Word

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (MSD and POS tag variants (all available morphosyntactic interpretations) output by Morfeusz, then disambiguated by Pantera tagger)

Annotation Tools:

Morfeusz SGJP, a tokenizer, moprhological analyzer and lemmatizer for Polish
Pantera, a Brill tagger for Polish

Start date: 01/01/2012

Semantic Annotation - Entity Mentions

StandOff: True

Segmentation level: Word Group

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Mixed (manual annotation with automatic preannotation)

Start date: 01/01/2012

Discourse Annotation - Coreference

StandOff: True

Segmentation level: Other

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Mixed (manual annotation with automatic preannotation)

Start date: 01/01/2012

Structural Annotation

StandOff: True

Segmentation level: Word

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (syntactic words (word-like compounds) detected by Spejd with NKJP shallow parsing grammar; see NKJP documentation for details)

Annotation Tools:

Spejd, a shallow parser of Polish

Start date: 01/01/2012

Syntactic Annotation - Shallow Parsing

StandOff: True

Segmentation level: Word Group

Format: text/xml

Standard practices conformance: TEI

Annotation Mode: Automatic (syntactic groups (phrase-like constructs) detected by Spejd with NKJP shallow parsing grammar; see NKJP documentation for details)

Annotation Tools:

Spejd, a shallow parser of Polish

Start date: 01/01/2012

Creation

Creation mode: Mixed

Original Sources

http://nkjp.pl

Creation Tools

Spejd, a shallow parser of Polish
Nerf, a named entity recognizer for Polish
Pantera, a Brill tagger for Polish
Morfeusz SGJP, a tokenizer, moprhological analyzer and lemmatizer for Polish

Resource Creation

Creation started: 01/05/2011

Funding Project

Computer-based methods for coreference resolution in Polish texts (CORE)

URL: http://zil.ipipan.wa...

Funding Type: National Funds

Funder: National Science Centre (100%)

Funding Country: Poland

Project duration: 18/04/2011 - 17/04/2014

Metadata

Created: 08/01/2013

Last Updated: 22/01/2013

Metadata Creator

Mateusz Kopeć

Version

Version: 0.5

Last Updated: 08/01/2013

People who looked at this resource also viewed the following:

Resources from the same project