Quaero Old Press Extended Named Entity corpus
View resource name in all available languages
Corpus Quaero de presse ancienne étendu en entités nommées
ID:
ELRA-W0073
The Quaero Old Press Extended Named Entity corpus consists of the manual annotation of 76 newspaper issues published in 1890-1891 and provided by the French National Library (Bibliothèque Nationale de France). Three different titles are used (Le Temps, La Croix and Le Figaro) for a total of 295 pages.
The corpus is fully manually annotated according to the Quaero extended and structured named entity definition, which differentiates entity "types" and "components". The training part of the corpus is composed of 231 pages and contains 1,297,742 words, 114,599 types and 136,113 components. The test corpus is composed of 64 pages and contains 363,455 words, 33,083 types and 40,432 components.
The Quaero Old Press Extended Named Entity Corpus consists of:
- 76 newspaper issues published in 1890-1891 and provided by the French National Library (Biblioth\`eque Nationale de France) (images and OCR output),
- 295 extracted pages in text format along with the corresponding images,
- the fully annotated txt corpus amounts to about 1,3 million words,
- a sub-corpus serving as a mini-reference corpus for quality evaluation purposes,
- tools developed for the extraction of text and images, for annotation and for evaluation,
- guidelines.
View resource description in all available languages