Prev   Next   Top

Clustering GE Data

 

The goal of clustering is to partition the genes into distinct sets such that genes that are assigned to the same cluster should have similar expression patterns, while genes assigned to different clusters should have non-similar expression patterns.
Usually there is no one solution that is the ‘true’ mathematical solution for this problem, but a good clustering solution should have two merits:

(1)   High homogeneity (average similarity between genes from the same cluster).

(2)   High separation (average distance/dissimilarity between genes from different clusters).

After operating one of the clustering algorithms a clustering results view appears. The view contains information about the solution and its quality including the method and parameters that were used to obtain it, number of clusters, number of singletons (probes that were not assigned to any cluster), overall homogeneity and separation, as well as the size and homogeneity of each cluster. This summary can be used to compare different solutions. 

In order to apply a clustering algorithm to the data, select the required algorithm from the Grouping >> Clustering menu (options are: KMeans, CLICK, SOM). You can also load an existing clustering solution from a file by selecting the Load Solution option from the Clustering menu (For details regarding the clustering solution file format, refer to the File Formats section). For details about Hierarchical clustering, please go to Hierarchical Clustering.

The CLICK algorithm is not designed to find clusters under the size of 15 probes, so it might fail in clustering small datasets.

Fill the required input data in the algorithm input dialog box and press the ‘Ok’ button.
The parameters required for each method are as follows:

 

Algorithm

Required parameters

KMeans

Expected number of clusters.

SOM

Grid width, grid length (width*length >= number of clusters) and number of iterations. 

CLICK

Homogeneity value (0-1): allows the user control over the homogeneity of the resulting clustering, i.e. the average similarity between elements in the same cluster. This parameter serves as a threshold in various steps in the algorithm, including the definition of cluster kernels, singleton adoptions and kernel merging. The default value for this parameter is the estimated homogeneity of the true clustering. The higher the value assigned to this parameter the tighter the resulting clusters.


Details about the algorithms can be obtained through the relevant articles in the References section.

After clustering is performed, a clustering solution visualization tab is added to the main window. It contains the following views:

Information regarding the clustering algorithm, number of clusters, number of un-clustered elements (singletons), and numerical measures of the clustering quality, including:

a)     Overall average homogeneity - calculated as the average value of similarity between each element and the center of the cluster to which it has been assigned, weighted according to the size of the cluster.

b)     Overall average separation – calculated as the average similarity between mean patterns of different clusters, weighted according to their sizes.

c)     Clusters table - contains the number, name (label), size and homogeneity of each cluster. The name of a cluster can be changed by editing the corresponding cell in the table.

Mean Patterns of all clusters with error bars (±1 STD).

A table listing all condition titles and their corresponding number used in the patterns display. Upon selecting a row in this table, the corresponding column in each of the mean pattern plots is marked.

Upon selecting a cluster (from the clusters table or from the mean patterns view), the corresponding cluster pane is displayed on the right. It contains a list of probes, probe patterns, expression matrix (heat map) and the chromosomal locations of the genes. Similarity matrices for probes within the cluster as well as for conditions are also displayed in this tab, if the relevant options in the display settings are selected (see the Settings section). If a network file has been loaded (via Data>>Load Network), the sub-graph, induced by the cluster is also displayed in the cluster pane.

 

 

 

 

After performing group analysis (for details see the Group Analysis Tools), if enrichment has been detected in the selected cluster, the corresponding histogram and analysis information are added to the single cluster view.

A clustering solution can be saved using the Grouping >> Clustering >> Save Solution, and reloaded using the Grouping >> Clustering >> Load Solution.

 

Defining a group according to similarity to a selected probe


This can be done by selecting Grouping >> Group by Pattern Similarity. An input dialog box allows setting the similarity measure (Pearson correlation, Spearman correlation or Euclidean distance) and reference probe ID as well as the expected group size. The visualization for this operation is similar to the clustering results visualization (described above).

 


Prev   Next   Top