Field and river

20th International Conference of Ethiopian Studies (ICES20)
Mekelle University, Ethiopia

"Regional and Global Ethiopia - Interconnections and Identities"
1-5 October, 2018

ICES20 logo

Use the "back" button of your browser to return to the list of abstracts.

FIVE ETHIOPIAN SPEECH CORPORA [Abstract ID: 0802-10]

DERIB Ado, Addis Ababa University, Ethiopia
FEDA Negesse, Addis Ababa University, Ethiopia
BAYE Yimam, Addis Ababa University
BINYAM Sisay, UNESCO
FEKEDE Menuta, Hawassa University
MOGES Yigezu, Addis Ababa University
Ronny MEYER, INALCO – Institut national des langues et civilisations orientales
Janne Bondi JOHANNESSEN, University of Oslo

In order to describe all aspects of a language, empirical sources are necessary. Individual interviews and various types of tests and questionnaires are valuable data, but sources that contain the language in actual use are indispensable. Corpora represent exactly this kind of data. For written languages, text corpora can be a valuable resource, but for languages and dialects that do not have a written standard, speech corpora (machine readable and searchable linguistic data) are required. The Norwegian NORHED project Linguistic Capacity Building – Tools for the Inclusive Development of Ethiopia, 2013–2019 – aims at producing orthographies and school material for some of the under-resourced languages of Ethiopia. So far the project researchers have developed five small speech corpora that have subsequently been put into the search system Glossa. These are presented below together with basic information:

• Amharic: 25500 words, 12 informants.
• Gumer: 19000 words, 14 informants.
• Hamar: 16900 words, 2 informants.
• Muher: 40500 words, 8 informants.
• Oromo:13350 words, 3 informants.

In addition, Gamo, Haddiya and Sidaama will be available in speech corpora soon, and more material will be added to the existing ones. All the corpora have been either audiotaped or videotaped. They have been transcribed into standard orthography, if any exists, or a modified IPA transcription, using ELAN (software for annotation) . Metadata for each recording include variables such as gender, age, language background, place etc. The recordings consist of interviews and conversations. The corpus search interface is Glossa (Johannessen et al. 2008, Kosek et al. 2015), which offers a search interface with three levels of complexity, including the possibility to search for beginning and end of words, beginning and end of segments, etc. The search can be filtered through the metadata, and can be done stepwise. The results are concordances and frequency lists, from where the concordance can be consulted, and audio and video files, which are time-aligned with the transcription. The corpora are freely available for the public and can also be installed on local machines.