Comparing the Performance of Convolutional Neural Networks and Vision Transformers in Object Detection: A Review

Najib Hassan Adamu; Anas Tukur Balarabe

doi:10.9734/ajrcos/2025/v18i12798

Comparing the Performance of Convolutional Neural Networks and Vision Transformers in Object Detection: A Review

Full Article - PDF Review History Discussion

Published: 2025-12-18

DOI: 10.9734/ajrcos/2025/v18i12798

Page: 184-201

Issue: 2025 - Volume 18 [Issue 12]

Najib Hassan Adamu *

Department of Computer Science, Sokoto State University, Sokoto, Nigeria.

Anas Tukur Balarabe

Department of Computer Science, Sokoto State University, Sokoto, Nigeria.

*Author to whom correspondence should be addressed.

Abstract

This review studies the evolution of object detection methodologies, from traditional to modern deep learning techniques, including CNNs (Convolutional Neural Networks), YOLO (You Only Look Once) variants (v1–v8), and ViTs (Vision Transformers). A systematic analysis of 49 studies shows that CNNs are robust on small datasets and in real-time applications. In contrast, ViTs excel at handling complex relationships and adversarial conditions due to their self-attention mechanisms. Hybrid models combining CNNs and ViTs show promise for improved accuracy and efficiency but usually require further validation. Key challenges include computational demands, dataset diversity, and generalisation across domains. Despite significant progress, there is limited consolidated analysis comparing CNNs, YOLO, and ViTs across diverse datasets and real-world constraints. The comparison in this study may level the ground for researchers to explore new gaps in the future, and results not only in reinforcing the potential of object detection techniques but also provide useful insights for researchers and practitioners aiming to balance performance with computational cost in real-world detection scenarios. Future research should prioritise hybrid architectures, edge deployment, and standardised benchmarking to advance object detection in different domains such as surveillance, healthcare, quality control, inventory management, and autonomous systems.

Keywords: Computer vision, CNNs (convolutional neural networks), YOLO (you only look once) YOLOv1–v8, Vision Transformers (ViTs)

How to Cite

Adamu, Najib Hassan, and Anas Tukur Balarabe. 2025. “Comparing the Performance of Convolutional Neural Networks and Vision Transformers in Object Detection: A Review”. Asian Journal of Research in Computer Science 18 (12):184-201. https://doi.org/10.9734/ajrcos/2025/v18i12798.

Downloads

Download data is not yet available.