Data Analytics in Bioinformatics. Группа авторов
Читать онлайн книгу.et al. [32] analyzed DNA microarray cancer data set by comparing different machine learning algorithms such as ANN, logistic regression, linear discriminant analysis, SVM and k-nearest neighbor for survival analysis of patients. One of the main findings here was that ANN is dependent on the statistical significance of the features so despite large sample size, ANN outperformed all other classifiers, in achieving greatest area under the curve.
Soto et al. in 2020 [33] dealt with 11_Tumors database, a wellrecognized database of gene microarrays related to cancer disease to generate the likelihood of types of cancer. This database had 12,533 gene expression microarrays for 174 samples and 11 different categories of cancer. Since the dataset had large number of features so the dimensionality reduction technique, PCA was used to reduce the number of features from 12,533 to 113. Classification was done using softsign activation function in the multilayer feedforward network (MLP) consisting of three hidden layers with 100 neurons in each layer. Using sigmoid activation function an output layer of 11 neurons was generated. Upon using 10-fold cross validation classification model was evaluated and achieved an accuracy of 97.14%.
Wei et al. [34] collected 56 cDNA microarray tumor samples form 49 neuroblastoma patients to predict the survival rate of high risk patients. To remove poor quality data principle component analysis algorithm was used and total 42,578 features were reduced to 37,920 data points. Each sample was analysed using a powerful ANN based predictor model. Despite this high complexity, 88% accuracy was seen to be achieved. An ANN-based gene minimization strategy had also been implemented by the author in a separate analysis of 19 genes. In this analytical process high risk patients were further divided in to subgroups based on their survival status. This derived subset of 19 genes correctly classified 98% of the patients. Finally they concluded that ANN-based approach has a significant ability in prediction of survival rate. This would allow personalized therapies for patients according to their gene expression profiles.
Cangelosi et al. [35] developed a robust classification model using ANN to predict neuroblastoma patient’s outcome with minimized error rate by defining a gene expression signature (NB-hypo), which measures the hypoxic status of 100 neuroblastoma tumor gene expression profiles. The ANN-MLP was applied to build a hypoxia predictor of 62 probe sets of the signature with potential clinical application to evaluate the hostile effect of tumor hypoxia on the progression of the disease. The result showed that MLP achieved a similar or better performance when compared to SVM, Naïve Bayes and logistic regression model. ANN proved itself to be a better competitive tool for predicting ‘poor’ or ‘good’ outcome of a patient while making an analysis of complex gene expression data.
Nayeem et al. [36] designed a classifier using 3 different datasets collected from UCI Machine Learning Repository for the diagnosis of heart disease, liver disorder and lung cancer. The proposed network showed an accuracy above 80% for each kind of dataset. MLP along with gradient descent optimization algorithm was used to minimize the error and Levenberg–Marquardt algorithm was used to avoid curve fitting problem.
The feedforward–backpropagation algorithm achieved a good performance even when the number of features was more. Authors depicted that the proposed model was able to show high accuracy than exiting models because of the use of more number of samples for training data, more than one hidden layer and large number of neurons in each hidden layer.
Bordoloi et al. [37] compared amino acid sequence of a particular protein structure to predict secondary structure of protein and for this work authors used a fully connected MLP with backpropagation algorithm with only one hidden layer. Here amino acid structures were the input to the network and predicted structures representing four different classes such as hemoglobin, myoglobin, sickle cell anemia and insulin were the output. The network was trained using back propagation algorithm to update weights and for evaluating the error. To minimize the error rate produced by the model, Levenberg–Marquardt optimization technique was implemented. The ANN model showed 100% accuracy in successfully identifying the required parameters. It was seen that when the number of epochs to train the model was increased, model produced high accuracy rate however, training time of the network also increased slightly. It was also observed that, since the dataset was large ANN model produced some computational constraints while training the model.
Shanthi et al. [38] used a feature selection method for stroke disease classification using multilayer perceptron. Research was carried out using the dataset having 20 attributes of 150 patients with symptoms of stroke disease. To reduce the number of features from 20 to 14 genetic algorithm was used in a neuro-genetic approach. This new feature set of symptoms were given to the model for prediction of type of stroke disease. The model was trained using backpropagation algorithm with 7 hidden neurons and sigmoid activation function. The result showed that this neuro-genetic approach obtained a better accuracy than traditional ANN with reduced number of features.
3.4.1 Comparative Analysis of ANN With Broadly Used Traditional ML Algorithms
Table 3.1 shows recent studies conducted in the field of bioinformatics for classification, prediction, and feature selection using different machine learning techniques. Most of the studies have used datasets like genomic, clinical, molecular, etc. From the large set of publications we have filtered out some of the recent and relevant papers for our research to make a comparative analysis between ANN and other machine learning models.
Table 3.1 Shows published articles of ANN used for biological data.
3.5 Critical Analysis
In this study we observed that ANN classifier outperformed all other classifiers with a reasonable accuracy result. Here we discuss our critical observation on various factors that improve the performance of ANN model in achieving high accuracy. ANN learns to solve complex problems because of their tremendous parallel processing, adaptive learning, fault tolerance and self-organization capability which ensure high classification performance. ANN has been the most powerful tool in classification and prediction. The performance of ANN algorithm depends on various factors such as pre-processing of data, (dimension of dataset, availability of incomplete data or noisy data, selection of features) activation function to be used, selection of number of epochs and neurons. Selecting huge amount of features or wrong features could affect the performance of model. The performance of ANN also depends on selection of right combination of input variables and other parameters.
In case of SVM model it is seen that when number of features exceeds number of samples the model tends to perform slow so more work on feature selection is required. But, when substantial amount of DNA sequencing data is present for two-class disease classification we can say SVM is a great classification model. We also observed that when inputs are noisy or incomplete, neural networks are still able to produce reasonable result. So the correct use of Data pre-processing technique could improve the performance of the classification model.
Another factor that influences the performance of the model is the right choice of activation function for classifying linear and non-linear data. One of the fastest learning activation functions is ReLU function that gives more accurate result because it is easy to optimize with gradient descent and result in global optimal solution.
It is observed that when number of hidden layers increases, the model gives relatively high accuracy. So we can say that performance of the model also depends on number of hidden layers and number of neurons used in each hidden layer. As we found in many articles that performance of the model hardly improves when less hidden layers are used in the network. It also increases the risk of convergence to local minima. The network fits the training data poorly if number