Optical Character Recognition and text cleaning in the indigenous South African languages

  • Danie Prinsloo University of Pretoria
  • Elsabé Taljard University of Pretoria
  • Michelle Goosen University of Pretoria
Keywords: text cleaning, Optical Character Recognition (OCR) tools, ‘noise’ in text-based corpora, scanning errors, text-sourced corpora, granularity of cleanness

Abstract

This article represents follow-up work on unpublished presentations by the authors of text and corpus cleaning strategies for the African languages. In this article we provide a comparative description of cleaning of web-sourced and text-sourced material to be used for the compilation of corpora with specific attention to cleaning of text-based material, since this is particularly relevant for the indigenous South African languages. For the purposes of this study, we use the term “web-sourced material” to refer to digital data sourced from the internet, whereas “text-based material” refers to hard copy textual material. We identify the different types of errors found in such texts, looking specifically at typical scanning errors in these languages, followed by an evaluation of three commercially available Optical Character Recognition (OCR) tools. We argue that the cleanness of texts is a matter of granularity, depending on the envisaged application of the corpus comprised by the texts. Text corpora which are to be utilized for e.g. lexicographic purposes can tolerate a higher level of ‘noise’ than those used for the compilation of e.g. spelling and grammar checkers. We conclude with some suggestions for text cleaning for the indigenous languages of South Africa.

Downloads

Download data is not yet available.

Author Biographies

Danie Prinsloo, University of Pretoria
Department of African languages, professor
Elsabé Taljard, University of Pretoria
Department of African Languages, professor
Michelle Goosen, University of Pretoria
Department of African languages
Published
2023-01-24
How to Cite
Prinsloo, D., Taljard, E., & Goosen, M. (2023). Optical Character Recognition and text cleaning in the indigenous South African languages. Stellenbosch Papers in Linguistics Plus, 64(1), 165-187. https://doi.org/10.5842/64-1-867
Section
Articles