Cluster analysis by Highly Connected Subgraphs with applications to cDNA clustering

Erez Hartuv, Sep 14, 1998

We have developed a novel algorithm for cluster analysis that is based on graph theoretic techniques. A similarity graph is defined and clusters in that graph correspond to highly connected subgraphs. A polynomial algorithm to compute them efficiently is presented. Our algorithm produces a clustering with some provably good properties. The algorithm is robust to high noise levels. It is fast and scalable.
The application that motivated this study was gene expression analysis, in which the purpose is to cluster cDNAs, based on their oligonucleotide fingerprints. One wishes to partition the cDNAs into groups where each group contains cDNAs originating from one gene and the group size reflects the gene's abundance. The algorithm has been tested intensively on simulated data and was shown to outperform extant methods. In a blind test on real cDNA fingerprint data the algorithm obtained very good results. Utilizing the results of the algorithm would have saved over 70\% of the cDNA sequencing cost on that data set.