Data Analytics in Bioinformatics. Группа авторов
Читать онлайн книгу.Sample 3
These gene expression matrices are summed up into a database as shown in Figure 2.2 which acts as a repository for all the genetic information like gene regulation, genetics functionality of diseases, drug discovery based upon gene structures, response to drug projection and so on. These databases with all the genomic information when searched for particular instance of genes, it retrieves all the relevant information with high similarities using these clustering algorithms [32].
The sequence of actions for Clustering of Gene Expression (GE), include
Figure 2.2 Example matrix of gene expression (10 genes in a row and 2 samples in columns) [29–31].
A matrix which hold the data of GE structure which consists of size, dimension, number, origin, etc.
Features are extracted to identify the similarities (feature extraction). The input data is reduced without loss of information making it more suitable to application is called feature selection or feature reduction. The feature reduction can be achieved using few common algorithms for example Principal Component Analysis [11].
Based upon similarities or features, all the genes which express similar expression are made into one cluster and which are distinct are grouped into another cluster. The degree of similarity or the proximity levels are computed using certain distance metrics applied on gene expression data provided. Euclidean distance, Manhattan distance, Mahalanobis distance, etc. are few metrics applied to quantify the level of similarity in order to form a cluster.
2.3.2 Clustering Algorithms
The genetic data (genes) are closely connected. Based upon the dataset appropriate clustering algorithms need to be chosen. These clustering approaches also reveals about the relationship between the clusters based upon the similarity. This section explains various partition algorithms. Figure 2.3 summarizes various clustering based partitions those can be used in Bioinformatics.
Figure 2.3 Partition clustering algorithms.
2.3.3 Partition Algorithms
2.3.3.1 k-Means Clustering
This algorithm partitions the dataset of n objects into k clusters in the feature space. The value of k (number of clusters) is fixed beforehand. Initially the algorithm randomly assumes k number of cluster centers (cluster mean) in the feature space. The distance (computed using Euclidean distance) of each gene sample is computed from all these cluster means. The sample is then grouped under the cluster that has minimum distance from the sample. The same process is followed for all the samples. Once all the samples are grouped under one of these clusters, each cluster mean is computed again considering the entire samples in that cluster and the cluster mean is updated. The above steps are iterated until the algorithm forms k stable clusters.
The drawback of such algorithm is that determining the clusters (k) prior to implementation reduces the effectiveness of cluster formation and is prone to noisy data in gene expression values [13].
2.3.3.2 Cluster Center Initialization Algorithm (CCIA)
In traditional k-means algorithm the centroids of clusters are randomly selected initially, yielding low quality clusters. As an attempt in producing high quality clustering CCIA is introduced. This algorithm can be referred as an enhancement to k-means algorithm in which centroids (k) are accurately selected which are distant from the outliers. This selection process is done by computing mean and standard deviation on the data elements. CCIA now examines most nearest set of elements from the dataset (D) and fabricates into a new dataset (X) after discarding the outliers. Later those set of elements are eliminated from the dataset (D). This process is iterated until the dataset (X) equals with (k).
Drawback—This algorithm is incapable in dealing with closely connected clusters and cannot deal with high dimensional data [14, 15].
2.3.3.3 Intelligent Kernel k-Mean (IKKM)
Generally gene expression data is nonlinearly separable and the gene data is represented in a matrix format using linear kernel function as shown in Figure 2.2. IKKM is applied on the kernel matrix in which clusters need not be predefined.
The algorithmic approach of IKKM is carried by calculating the centre of mass of the entire dataset and then identifies the element (×1) which is distant from the centre of mass (C) another element (×2) is calculated distant from c1 is calculated. Now the elements near to c1 are moved into cluster (S1) and the elements near to c2 are moved into another cluster (S2). This process is repeated until no change is observed in the clusters. In contrast to traditional k-means approach this algorithm implementation doesn’t require predicting the number of clusters in advance.
Drawback—Cannot deal with high dimensional dataset [16].
2.3.3.4 Clustering Large Applications (CLARA)
Rather than considering and calculating medoids for the entire dataset CLARA considers a subset from the dataset with a constant size upon which PAM(partitioning around medoids) algorithm is applied. PAM algorithm aims at reducing the degree of similarity by calculating the distance between the cluster medoids and the elements in the subset space. As a result an optimal set of medoids are obtained which requires less store space when compared to other algorithms. In contrast to other partitioning algorithms this measures the degree of dissimilarity amongst the elements and the medoid of the cluster. This algorithm is applied on huge datasets producing accurate results with less computation time [17].
Drawback—Cannot deal with higher dimensional data and closely connected clusters which leaves this algorithm a second choice in gene expression data.
2.3.4 Hierarchical Clustering Algorithms
Hierarchical clustering