Machine Learning Techniques and Analytics for Cloud Security. Группа авторов

Читать онлайн книгу.

Machine Learning Techniques and Analytics for Cloud Security - Группа авторов


Скачать книгу
target="_blank" rel="nofollow" href="#fb3_img_img_1d0a6558-44a3-54c8-96e2-4af18c232cd3.jpg" alt="image"/>

      (3.2)image

      The above equation is known as sigmoid function

      (3.3)image

      If “t” proceeds toward infinity, then the predicted variable Y will become 1. On the other hand, if “t” moves to infinity toward negative direction, then prediction of Y will be 0.

      Mathematically this can be written as

      (3.4)image

      Computing the probability that y = 1 when x is given and it is parameterized by θ

      (3.5)image

      (3.6)image

      While implementing the proposed method, we are selecting a bunch of r genes at random. This dataset of r × n (r denotes number of genes and n denotes number of samples) is partitioned into two sets, i.e., train and test. A certain percentage, say, p%, of the data is chosen as training set and rest is used as test set.

      Features need to be scaled down before applying PCA. Standard scalar is used here for standardizing the features of the available dataset. Here, it is taken a onto unit scale (mean value is taken as 0 and variance is taken as 1). Now, PCA is applied with α as the number of parameters signified as components. It means that scikit-learn choose the minimum number of principal such that α% of the variance is retained.

      After applying PCA, selected genes are fitted into the LR model. Test data and predicted data values are compared, and accuracy score is calculated. To obtain the gene with good accuracy score, the iterative LR and PCA was fitted at each iteration step and every time r random genes were selected. After the completion of the iterative process, the final list was sorted in descending order of the calculated accuracy score and top genes were selected.

      3.3.2 Flowchart

Schematic illustration of flowchart of PC-LR algorithm.

      3.3.3 Algorithm

      Step 1: Read dataset G = {GN, GC} and make G = (GN)T∪(GC)T.

      Step 2: Partition dataset into training and test dataset with a ratio.

      Step 3: Standardize the dataset’s value onto unit scale.

      Step 4: Repeat steps 5 to 9 till all genes are selected and processed.

      Step 6: Applying PCA on selected genes to retain α% of the variance.

      Step 7: Training the selected gene on Logistic Regression model.

      Step 8: Predicting accuracy score by comparing the test data and predicted data

      Step 9: If accuracy_score>threshold_value, then store these genes in a new list as resultant set of gene.

      [end if]

      Step 9: Stop

      3.3.4 Interpretation of the Algorithm

      The gene expression data values in the dataset vary in size. The numerical columns of the dataset need to be reduced to a common scale without any distortion of the differences lying in the range of values; therefore, standardization is needed to be used. Standardization is a form of scaling where the values are considered as centered on the basis of mean with a standard deviation taken as another component. Now, to start the iterative process for working a set of “r” genes were selected at random, and these genes were passed to train the model. Once these genes are selected, these are marked so that they will not be selected again in the iterative process. As in the dataset, it is observed that number of features (genes) is very large in compare to number of samples. In order to reduce curse of dimensionality, PCA was applied on the above selected “r” genes and a certain percentage, say, α%, of the variance is tried to be retained. Applying PCA, these genes were passed to train LR model. After training the model, test data is used to predict the outcome. Now, these predicted outcomes were checked against the actual outcomes of the test data and accuracy is calculated. If its accuracy level founds to be more than 85%, then these genes are extracted and stored in a list of candidate genes. The entire process gets repeated until all the genes were marked as selected in the dataset and accuracy was found to be considerate.

      3.3.5 Illustration

      The accessible dataset persists in two states, i.e., GN and GC, where GN denotes dataset of non-cancerous and GC denotes dataset of cancerous state. The designed algorithm is examined on both lung and colon dataset. Both GN and GC are combined and grouped together as one dataset G. Then, the dataset was transposed, i.e., rows became columns and columns became rows. A target variable Y was chosen and dataset was divided into dependent (Y) and independent (X) data.

      After dividing into training and test data, the feature of the dataset is scaled down onto unit scale. Then, PCA is fitted onto training and test data of X to retain 95% of the variance. Then, LR was fitted on the training data of X and Y and predicted value is calculated using test data of X. At last, accuracy score was calculated by comparing the test data of Y and the predicted values. If the accuracy was found to be more than 85%, then those genes are considered as cancer mediating genes and stored in a new list as result set.

      3.4 Result

      The output of the proposed algorithm is a set of genes which are identified as their expression level changes significantly and can be referred as genes having correlation with cancer. The algorithm actually is experimented with some authentic dataset accessible from NCBI database. Two datasets, viz., lung and colon, have


Скачать книгу