"Le Monde Diplomatique" Arabic tagged corpus
View resource name in all available languages
Corpus étiqueté du journal "Le Monde Diplomatique" en arabe
This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04).
To each text are associated 3 files :
- raw text in Arabic,
- vowelized text in Arabic,
- one XML file containing the morphological annotation of the text.
Each text word associates a certain number of information, such as word size, rank of the word in the text, paragraph number where the word was found, etc. Each word associates a node in the XML file. Each node contains the following positional features of the word in the text:
- Paragraph number in the text, i.e. paragraph where the word can be found,
- Sentence number in the paragraph,
- Sentence number in the text,
- Rank of the word in the text,
- Rank of the first character of the word in the text,
- Word size.
Information about word annotation are added as « sub-nodes »:
- Word of non vowelised text,
- Vowelised word,
- Word lemma,
- Grammatical category of the word.
View resource description in all available languages