Field and river

20th International Conference of Ethiopian Studies (ICES20)
Mekelle University, Ethiopia

"Regional and Global Ethiopia - Interconnections and Identities"
1-5 October, 2018

ICES20 logo

Use the "back" button of your browser to return to the list of abstracts.

WEB CORPORA FOR FOUR MAJOR ETHIOPIAN LANGUAGES [Abstract ID: 0801-06]

DERIB Ado, Addis Ababa University, Ethiopia
FEDA Negesse, Addis Ababa University, Ethiopia
SHIMELIS Mazengia, Addis Ababa University
GIRMA Mengistu, Addis Ababa University
AHMED Yusuf Hirad, Jigjiga University
Janne Bondi JOHANNESSEN, University of Oslo

This paper describes web text corpora for the four major languages of Ethiopia: Amharic (17,000,000 words), Oromo (4,000,000 words), Somali (72,000,000 words), and Tigrinya (2,000,000 words). The development of the corpora was made possible through a joint venture of two projects: Linguistic Capacity Building, tools for the inclusive development of Ethiopia, a joint project between four Norwegian and Ethiopian universities; and the Czech-Norwegian HaBiT project. The technical development of the corpora, including harvesting the web texts for the four languages, was fully undertaken by the Centre for Natural Language Processing, at Masaryk University, Czech Republic, whereas the linguistic aspect of the corpora, which includes revision of 350-450 seed bi-grams for language detection, quality checking and evaluation of the corpora, was done by the Department of Linguistics at AAU. The corpora are presented in the Habit System (Kala et al. 2017). The search system of the four corpora has options for simple and advanced concordances down to character level, provides frequency per million of search items, generates list of words, allows for advanced search using regular expressions, word sketches, and thesauruses. The Amharic corpus was POS tagged using the tagset developed by Demeke & Getachew (2006). The main challenges in developing the corpora were bigger citations of other languages, such as Ge’ez in Amharic, and lack of balance between domains, as the on-line content of the four languages is skewed towards religion and politics. The raw text of all the four web-text corpora are available for download and will also be available in the Glossa corpus management system, (Johannessen et al. 2008).