Cluster analysis by Highly Connected Subgraphs with
applications to cDNA clustering
Erez Hartuv,
Sep 14, 1998
We have developed a novel
algorithm for cluster analysis that is based on graph theoretic techniques.
A similarity graph is defined and clusters in that graph correspond to
highly connected subgraphs. A polynomial
algorithm to compute them efficiently is presented. Our algorithm
produces a clustering with some provably good properties.
The algorithm is robust to high noise levels.
It is fast and scalable.
The application that motivated this study was gene expression analysis, in
which the purpose is to cluster cDNAs, based on their oligonucleotide
fingerprints.
One wishes to partition the cDNAs into groups where each group contains cDNAs
originating from one gene and the group size reflects the gene's abundance.
The algorithm has been tested intensively on simulated data and was shown
to outperform extant methods.
In a blind test on real cDNA fingerprint data the algorithm obtained very
good results. Utilizing the results of the algorithm would have saved
over 70\% of the cDNA sequencing cost on that data set.