Polish Sejm Corpus

PSC

ID:

401

The Polish Sejm Corpus contains annotated utterances of Polish Sejm members from terms of office 1-6 (years 1991-2011). Corpus files contain information about text segmentation (paragraphs, sentences, tokens), disambiguated morphosyntactic description (lemma, POS tag, MSD tag), syntactic description (syntactic words and groups) and named entities (person names, locations, organization).

The data is a valuable source of linguistic information, being a large (100 M segments) collection of quasi-spoken content and making the basis of the audio/video recording of sessions, started in 2011 and planned to be consecutively appended to the corpus.

You don’t have the permission to edit this resource.
  • Nerf
  • scripts developed internally
  • Pantera
  • scripts developed internally
  • Morfeusz SGJP
  • Pantera
  • Morfeusz SGJP
  • Morfeusz SGJP
  • Spejd
  • Spejd
  • scripts developed internally
  • Sprawozdanie Stenograficzne. Kancelaria Sejmu Rzeczypospolitej Polskiej, ul. Wiejska 4/6/8, 00-902, Warszawa, Poland. Wydawnictwo Sejmowe, 1991-2011. ISSN 08672768. http://www.sejm.gov.pl
  • Spejd, a shallow parser of Polish
  • Pantera, a Brill tagger for Polish
  • Morfeusz SGJP, a tokenizer, moprhological analyzer and lemmatizer for Polish
  • Nerf, a named entity recognizer for Polish