|
||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Gene Expression Clustering Methods A gene expression pattern derived from one microarray hybridization provides a snap shot of the state of of a living cell, which determines its biological behavior. As an example, a human cell contains approximately 3 billion base pairs, which encode about 50,000 to 100,000 genes. To add further complication, only a fraction of these genes are expressed in any given tissues. On the other hand, instead of treating gene expression pattern from a given microarray experiment as a single data entity, we can examine one gene a time across a biological process or a collection of biological samples, hence the gene expression profile. Clustering analysis is a powerful tool which partitions biological samples or genes into well-separated and homogeneous groups based on their statistical behaviors. The main objective of clustering analysis is to find out the similarities between experiments or between each genes, given their expression ratios across all genes or samples, respectively, and then group the similar samples or genes together for the convenience of understanding and visualization. The clustering methods have been heavily studied for many years and widely applied in many areas. In this section, we will discuss some implementation that we've employed in our gene expression analysis. They are,
1. Hierachical Clustering methodAssuming we have m expression experiments containing n genes in each every experiment. After performing microarray image analysis and data integration, we obtained a mxn matrix of gene expression ratio, where each column of ratios represents the result from one expression experiment comparing the test sample to a common reference sample of choice.To simplified the discussion, we will only consider the algorithm in terms of the sample clustering. To achieve the objective of clustering, we first evaluate all pair-wise similarities between samples , and then we employ the average linkage algorithm to group similar samples. Typically, we use Pearson correlation coefficient or Euclidean distance to quantify the similarity. Under certain normalization condition, these two similarity measurements are equivalent. After evaluating similarities from all pairs of samples, we can construct a distance matrix as shown below (Table1a). The hierarchical algorithm proceeds as follows. First we look for a pair of experiments with shortest distance or most similar in gene expression pattern (as given in the table, Exp1 and Exp2). We then construct a 'composite experiment' by averaging (thus the term of average-linkage algorithm) all gene expression ratios (log-transformed) from two experiments, and name it as Exp1-2. We again evaluate all distance from this composite experiment to all other experiments, and construct a smaller matrix as shown in Table 1b. This procedure is repeated until the distance matrix is reduced to single element.
The graphical visualization of the hierarchical algorithm is illustrated by dendrogram where each merger is represented by a binary tree, and the length of each branch is indicative of the distance between two samples as given in Table 1a-b. The implementation of the average linkage method will be available through a web interface. The downable version of the program is also in the preparation. 2. K-means Algorithm and Fuzzy C-means AlgorithmUnder construction
3. Self-Organizing mapUnder construction
4. Neural NetworkUnder construction. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||