Abstract
Deoxyribonucleic (DNA) sequence categorization is a significant task in a generic computational setting for biomedical data processing. The sequence information contains the genome information it can retrieve from human chromosome cells. The gene information in the DNA sequence is used to predict the disease, especially cancer diagnosis and therapy. The class samples in the gene expression data are imbalanced. The main objective is to enhance the sequence of samples to make an accurate class prediction. To analyze and categorize the sequence information, which is the challenge task, dominant computational techniques are required. Deep learning (DL) and machine learning (ML) techniques are used for training purposes to process and categorize the genome information. In the data preprocessing stage for converting the sequence information into numerical values, ordinary encoding, one-hot encoding, and k-mer counting techniques are applied. The DNA sequence information contains insufficient samples based on the class labels. To predict better results, the proposed Wasserstein Sequence Generative Adversarial Network (WSEQ-GAN) method is utilized for augmented sequence data, and results are compared with traditional methods like sampling and SMOTE. Traditional ML and DL techniques like Support Vector Machine (SVM), K Nearest Neighbor (KNN), and Long Short-Term Memory (LSTM) are used to train and classify the sequence data. The augmented and non-augmented data using WSEQ-GAN were compared with existing methods. As a result, the proposed WSEQ-GAN with the LSTM network achieved 98% classification accuracy better than existing classification and augmentation techniques.
Keywords: Deep Learning Methods, DNA Sequence, Machine Learning Methods, WSEQ-GAN.