Prev   Next   Top

Clustering GE Data:

The goal of clustering is to partition the genes into distinct sets such that genes that are assigned to the same cluster should have "similar" expression patterns, while genes assigned to different clusters should have "non-similar" expression patterns.
Usually there is no one solution that is the "true" mathematical solution for this problem, but a good clustering solution should have two merits:

(1) High "homogeneity" - average similarity between genes assigned to the same cluster.
(2) High "separation" - average distance (dis-similarity) between genes assigned to different clusters.

After operating one of the clustering algorithms a clustering info window appears. It summarizes the overall performance of the solution in terms of its overall homogeneity and separation, together with the homogeneity of each cluster. This summary can be used to compare different solutions. 

In order to apply a clustering algorithm to the data select the required algorithm from the Grouping>>Clustering menu (options are: KMeans, Click, SOM or Input File, which is an existing solution file).

The Click algorithm is not designed to find clusters under the size of 15 probes, so it might fail in clustering small datasets.

Fill the required input data in the algorithm input dialog box and press the “Ok” button.
The parameters required for each method are as follows:

Algorithm

Required parameters

KMeans

Expected number of clusters.

SOM

Grid width, grid length (width*length >= number of clusters) and number of iterations. 

Click

Homogeneity value (0-1): allows the user control over the homogeneity of the resulting clustering, i.e. the average similarity between elements in the same cluster. This parameter serves as a threshold in various steps in the algorithm, including the definition of cluster kernels, singleton adoptions and kernel merging. The default value for this parameter is the estimated homogeneity of the true clustering. The higher the value assigned to this parameter the tighter the resulting clusters.

Input file

Clustering solution file path

 

Details about the algorithms can be obtained through the relevant articles in the References section.

After clustering is performed, a clustering solution visualization tab is added to the main window. It contains the following views:

Information regarding the clustering algorithm, number of clusters, number of unclustered elements (singletons), and numerical measures of the clustering quality, including:

a) Clustering homogeneity = average similarity between elements from the same cluster and the center of the cluster.

b) Clustering separation = average similarity between elements from different clusters.

c) Clusters table - contains the number, size and homogeneity of each cluster. 

Mean Patterns of all clusters with error bars (±1 STD).

A table listing all condition titles and their corresponding number used in the patterns display. Upon selecting a row in this table, the corresponding column in each of the mean pattern plots is marked.

Upon selecting a cluster (from the clusters table or from the mean patterns view), the corresponding probe list, probe patterns and expression matrix are displayed on the right.

 

 

After performing Promoter analysis or Functional analysis (for details see the Group analysis tools section) ,  if the selected cluster has been found to be enriched with TF binding sites or GO annotations, the corresponding histogram and analysis information are added to the single cluster view.

A clustering solution can be saved using the Grouping >>Clustering>> Save Solution, and reloaded using the Grouping >>Clustering>> Load Solution.


Prev   Next   Top