Modular Co-attention Networks in Nepali Visual Question Answering Systems

Aashish Gyanwali

Department of Electronics and Computer Engineering, Thapathali Campus, Institute of Engineering, Tribhuvan University, Nepal.

Binod Sapkota *

Department of Electronics and Computer Engineering, Thapathali Campus, Institute of Engineering, Tribhuvan University, Nepal.

Abhishek Koirala

Department of Electronics and Computer Engineering, Thapathali Campus, Institute of Engineering, Tribhuvan University, Nepal.

Babu R Dawadi

Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering, Tribhuvan University, Nepal.

*Author to whom correspondence should be addressed.


Abstract

Visual question answering (VQA) has been regarded as a challenging task requiring a perfect blend of computer vision and natural language processing. As no dataset was available to train such a model for the Nepali language, a new dataset was developed during the research by translating the VQAv2 dataset. Then the dataset consisting of 202,577 images and 886,560 questions was used to train an attention-based VQA model. The dataset consists of yes/no, counting, and other questions with primarily one-word answers. Modular Co-attention Network (MCAN) was applied to the visual features extracted using the Faster RCNN framework and question embeddings extracted using the Nepali GloVe model. After co-attending the visual and language features for a few cascaded MCAN layers, the features are fused to train the whole network. During evaluation, an overall accuracy of 69.87% was obtained with 81.09% accuracy in yes/no type questions. The results surpassed the performance of models developed for Hindi and Bengali languages. Overall, novel research has been done in the Nepali Language VQA domain paving the way for further advancements.

Keywords: Nepali visual question answering, Nepali VQA Dataset, modular Co-attention


How to Cite

Gyanwali, Aashish, Binod Sapkota, Abhishek Koirala, and Babu R Dawadi. 2024. “Modular Co-Attention Networks in Nepali Visual Question Answering Systems”. Asian Journal of Research in Computer Science 17 (10):62-84. https://doi.org/10.9734/ajrcos/2024/v17i10510.