ROCO Romanian journalistic corpus

55 Last view: 2025-12-03

ROCO Romanian journalistic corpus

View resource name in all available languages

Corpus journalistique du roumain ROCO

ROCO

http://catalog.elra.info/product_info.php?products_id=1249

ID:

ELRA-W0085

ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626. It is rich in proper names, numerals and named entities.

The corpus contains morphosyntactic information (MSD annotations) which has been assigned automatically with the high accuracy (estimated 98%) TTL tagger implementing the tiered tagging methodology. About 20% of the MSD annotations have been manually checked, validated and, where the case, corrected. MSDs follow the Multext-East specifications. For Romanian there are 614 different MSDs. They have been slightly modified (new tags for named entities have been added).

The corpus was first segmented, then PoS annotated and lemmatized with the TTL processing chain. The corpus has been XML encoded and each file includes metadata (cesHeader).

View resource description in all available languages

ROCO est un corpus journalistique du roumain contenant 7,1 millions de mots (tokens) pour un nombre de types se montant à 231,626. Le corpus est riche en noms propres, numéraux et entités nommées.

Il a été annoté au niveau morphosyntaxique (annotations MSD) avec l’étiqueteur TTL qui implémente la méthodologie d’étiquetage à plusieurs niveaux et qui a une précision estimée de 98%. Environ 20% des annotations MSD ont été validées manuellement. Les annotations morphosyntaxiques (MSD) suivent les spécifications Multext-East. Il y a 614 MSD différentes pour le roumain (de nouvelles étiquettes ont été ajoutées pour les entités nommées).

Le corpus a été d’abord segmenté, ensuite annoté en parties du discours et lemmatisé avec la chaîne de traitement TTL. Il a été encodé en XML et chaque fichier contient des métadonnées (cesHeader).

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 30/11/2015

Licence

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Non Members of ELRA

Fee: 5,000.00

User Nature: Commercial

ELRA VAR

Restrictions: Commercial Use

For Members of ELRA

Fee: 3,000.00

User Nature: Commercial

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Members of ELRA

Fee: 3,000.00

User Nature: Commercial

ELRA VAR

Restrictions: Commercial Use

For Members of ELRA

Fee: 3,000.00

User Nature: Academic

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Members of ELRA

Fee: 0.00

User Nature: Academic

ELRA VAR

Restrictions: Commercial Use

For Non Members of ELRA

Fee: 5,000.00

User Nature: Commercial

ELRA VAR

Restrictions: Commercial Use

For Non Members of ELRA

Fee: 5,000.00

User Nature: Academic

ELRA END USER

Restrictions: Academic - Non Commercial Use

For Non Members of ELRA

Fee: 0.00

User Nature: Academic

Contact Person

Mapelli Valérie

text

Monolingual text corpusLanguages

Romanian

Linguality

Linguality type: Monolingual

Size

no size available

Metadata

Created: 12/05/2005

Version

Version: 1.0

Last Updated: 30/11/2015

People who looked at this resource also viewed the following: