Helsinki Corpus of Swahili 2.0 Annotated Version

View resource name in all available languages

Helsinki Swahili -korpus 2.0, annotoitu versio

HCS 2.0 Annotated


The Helsinki Corpus of Swahili 2.0 Annotated Version containing about 25 million words, will be made available in Korp ( The precise number of words is difficult to count on the basis of annotated texts, because about 8 percent of ‘words’ are multiword expressions, and about 15 percent of tokens are non-alphanumeric codes, such as diacritics, punctuation marks and xml-codes. On the basis of these calculations it is estimated that there are about 25 million individual words in the corpus.

The corpus contains various kinds of linguistic information attached to each token. The corpus was annotated using Salama Tagger.

Preparation of the material

Most of the corpus material was retrieved from the Web. This method was used increasingly after texts in the Web became available. Only texts in news media and on open government pages were retrieved. Some types of texts, such as books, were scanned and proofread. Part of the oldest news material before the time of scanners in the 1980’ies was manually typed.

The corpus material has gone through a series of formatting and correction routines.

1. Converting the text into ascii-format, required by the tagger. There is a wild variety of codes for describing diacritics in Web texts. These had to be formalized.
2. Proofreading and correcting the text with a speller.
3. Analyzing the proofread text for finding still remaining typos and possibly new words.
4. Constructing a correction program that automatically corrects such typos that can be safely corrected. More than 8000 such mistake types were identified.
5. New words found in corpus were added to the parser.
6. Texts were corrected using the constructed correction program.
7. Metadata in text files were formalized.
8. Texts were converted into sentence-per-line format.
9. Text within each file was randomly shuffled to mix the sentence order.

The result of these routines comprises the Helsinki Corpus of Swahili 2.0 Not Annotated Version.

The result of these routines was annotated with Salama Tagger, thus producing the Korp format of the corpus.

Metadata were added to each file.

Structure of the corpus

HCS 2.0 contains the following types of material:

Old material

1. Books
2. News
New material
1. Bunge
2. News

Old material contains material before 2003. Much of this material is in Helsinki Corpus of Swahili 1.0. The big difference is, however, that while in the earlier corpus only sections of books were included, in the new corpus whole texts are included. The other difference is that while in the old corpus text sections are in the original order, in the new corpus sentences are randomly shuffled.

Most of the material is news texts. The section ‘Bunge’ contains Hansards of Tanzanian Parliament from the years 2004, 2005 and 2006. Metadata in the beginning of each file give more information. Also the names of the files give hints of the contents of the files.

A word in the annotated corpus contains normally the following types of information:

1. token
2. stem
3. part-of-speech
4. morphological description
5. gloss in English
6. syntactic tag
7. rest of verb description

The last point concerns only verbs.

You don’t have the permission to edit this resource.