Use the "back" button of your browser to return to the list of abstracts.
SOMALI CORPUS: A FRAMEWORK FOR LINGUISTIC ANNOTATION [Abstract ID: 0801-09]
Developing IT resources for language mainly focuses on well-described languages with long standing written traditions and with a large number of speakers. One of the main challenges for the languages with more recent written traditions is the lack of enough data for successful statistical approaches. This descriptive paper aims to present the state of the art of the construction of the Red sea Cultural Foundation’s Somali Corpus (RCF-SC), in collaboration with Oriental University of Naples (Italy), and the development of a series of computer programs with which to analyze the corpus data for various purposes. The core of RCF-SC is unique in Somali speaking countries and wants to be, for Somali, a resource equivalent in quality to the British National Corpus. The first edition of the corpus, containing 5 million words tagged and grammatically annotated, is online at www.somalicorpus.com.