Overcoming Challenges in Corpus Studies of Low-Resource Languages Through Insights from a Sesotho Readability Metrics Project
CLAREP Journal of English and Linguistics (C-JEL)
Author: Johannes Sibeko
Institution:Nelson Mandela University, South Africa
Email:jo*******@ma*****.za
Abstract
This article is a narrative inquiry into the challenges of computational and corpus studies in low-resource languages, with a focus on Sesotho. To achieve this, I reflect on the challenges encountered in a larger project on developing readability measures for Sesotho, a low-resource indigenous language of Southern Africa. Among others, the scarcity of annotated texts emerges as a primary challenge, worsened by inconsistencies in the quality of available data and the inconsistencies in the orthography of Sesotho. Furthermore, I discuss the strategies and solutions that were employed to circumvent these challenges. Among others, the use of data from translated government texts and religious scriptures, along with collaborative efforts and resource sharing which offers options to overcome the obstacle of lacking data are discussed. Furthermore, I discuss the importance of sustainable funding and open access principles for the enduring viability and broader impact of corpus studies in low-resource language contexts. Looking ahead, opportunities for refining methodologies and exploring alternative methodologies such as crowdsourcing as well as expert sourcing in corpus creation and dissemination are highlighted, offering promising directions for future research in this domain. Overall, this article contributes to the scholarship of effective strategies for corpora studies including corpus creation, curation, and analysis in low-resource languages such as Sesotho, paving the way for improved linguistic research and development in diverse linguistic contexts.
Keywords: Low-resource language, Sesotho, Corpus studies, Readability measures, Computational linguistics
Pages: 77-98
ISSN: 2698-654X
ISBN: 978-3-96203-404-7 (Print)
ISBN: 978-3-96203-405-4 (PDF)
DOI: https://doi.org/10.56907/gvklcltj