Fine-Tuning DeepSpeech Speech-To-Text Model for Nigerian English and Yoruba-English Code-Switched Speech

Oluwaseyi Ezekiel Olorunshola; Fortunatus Omotebi Olorunsola; Fatimah Adamu-Fika

doi:10.9734/ajrcos/2025/v18i8737

Fine-Tuning DeepSpeech Speech-To-Text Model for Nigerian English and Yoruba-English Code-Switched Speech

Full Article - PDF Review History Discussion

Published: 2025-08-07

DOI: 10.9734/ajrcos/2025/v18i8737

Page: 24-34

Issue: 2025 - Volume 18 [Issue 8]

Oluwaseyi Ezekiel Olorunshola *

Department of Computer Science, Air Force Institute of Technology, Kaduna, Nigeria.

Fortunatus Omotebi Olorunsola

Department of Computer Science, Air Force Institute of Technology, Kaduna, Nigeria.

Fatimah Adamu-Fika

Department of Cyber Security, Air Force Institute of Technology, Kaduna, Nigeria.

*Author to whom correspondence should be addressed.

Abstract

Speech-to-Text (STT) systems, despite their stellar performance in recent years, still struggle with recognising non-Western English accents and speech that features Code-Switching (CS), a linguistic phenomenon common in regions such as Nigeria. This study addresses that challenge for Nigerian English and Yoruba-English code-switched speech by adapting Mozilla’s DeepSpeech 0.9.3 model and fine-tuning it using a custom dataset of 118 minutes (approximately 1.97 hours). This process involved transfer learning and hyperparameter optimisation over iterative training sessions on a CPU-based setup. The model’s performance was evaluated using Word Error Rate (WER) and Character Error Rate (CER), with the best model showing modest improvements over the baseline model and achieving a WER of 0.760261 and CER of 0.381241 after 55 epochs. Although limited computing resources and the small dataset imposed significant constraints on the work, the study demonstrated the potential of fine-tuning and transfer learning for model adaptation to low-resource languages and code-switching contexts. Future work will require access to GPU resources for improved convergence and transcription accuracy, an expanded dataset and support for Yoruba diacritics to improve the quality of transcriptions.

Keywords: DeepSpeech, speech-to-text technology, code switching, accents

How to Cite

Olorunshola, Oluwaseyi Ezekiel, Fortunatus Omotebi Olorunsola, and Fatimah Adamu-Fika. 2025. “Fine-Tuning DeepSpeech Speech-To-Text Model for Nigerian English and Yoruba-English Code-Switched Speech”. Asian Journal of Research in Computer Science 18 (8):24-34. https://doi.org/10.9734/ajrcos/2025/v18i8737.

Downloads

Download data is not yet available.