Abstract
Gastrointestinal (GI) endoscopy is crucial for the diagnosis of digestive diseases. It provides detailed visual information about the GI tract and helps identify abnormalities. However, the analysis of endoscopic images is challenging due to their complexity and variations caused by factors like lighting, texture and patient movement. These challenges highlight the need for advanced methodologies to enhance diagnostic accuracy and efficiency. This study introduces a novel deep learning framework integrating hybrid CNN-transformer models enhanced by a Convolutional Block Attention Module (CBAM). The framework utilizes a pre-trained Vision Transformer (ViT) to capture global image features and a convolutional neural network (CNN) to extract local features. CBAM refines the focus on relevant regions and enhances the interpretability and performance of the model. Ensemble learning was used to combine predictions from multiple models and improve the reliability and accuracy of the framework. The proposed model was evaluated on the publicly available Kvasir GI endoscopy dataset and demonstrated superior performance with an accuracy of 94.13% and a precision of 94.21%, outperforming existing methods. This framework offers a reliable and effective solution for analysing GI endoscopy images, potentially improving the accuracy and reliability of automated diagnosis. This can lead to early disease detection and improved patient outcomes.
Keywords: Convolutional Block Attention Module, Gastrointestinal Disease Detection, Hybrid CNN-Transformer, Wireless Capsule Endoscopy.