Field and river

20th International Conference of Ethiopian Studies (ICES20)
Mekelle University, Ethiopia

"Regional and Global Ethiopia - Interconnections and Identities"
1-5 October, 2018

ICES20 logo

Use the "back" button of your browser to return to the list of abstracts.


ASMELASH Teka Hadgu, L3S Research Center

There is a scarcity of publicly available linguistic resource to perform Ethiopic research on computational linguistics in Ethiopic. In this paper, an attempt is made to bridge this gap by building computational linguistic resources for Ethiopic from the Web. The study has gathered a large scale linguistic corpus through web scraping heterogeneous web-pages for Bibles, news media articles and blog posts as well as popular social media sites such as Twitter for social feeds. Performed preliminary experiments on two tasks (i) language identification on Amharic, Tigrinya and Ge'ez and (ii) learning word embedding for Amharic and Tigrinya. Achieved a state-of-the-art result on the language identification task.The contributions of this work are three fold: raw corpus for Ethiopic based languages, a language identification tool for these languages and pre-trained word vectors for Amharic and Tigrinya. The study contributes to make the computational tools and resources to the research community.