Pages

Hierarchical Clustering in R

Hierarchical clustering is a technique for grouping samples/data points into categories and subcategories based on a similarity measure. Being the powerful statistical package it is, R has several routines for doing hierarchical clustering. The basic command for doing HC is

hclust(d, method = "complete", members=NULL)


Nearly all clustering approaches use a concept of distance. Data points can be conceptualized in a dimensional space where we can compute a distance measure between each pair of points. Points that are closer together get placed in groups or "cluster" together. The input for the hcluster method in R is a distance matrix, so we'll have to compute one of those somehow.

For this example, I'm using an Identity By State matrix, computed with SNPDUO software. This matrix essentially gives us the percentage of alleles two people share. It doesn't output a matrix per se, so I had to transform the output with a perl script. I get something like this out of SNPDUO:

Family1,Indiv1,Family2,Indiv2,MeanIBS
Family1,Indiv1,Family2,Indiv3,MeanIBS
Family1,Indiv1,Family2,Indiv4,MeanIBS
Family1,Indiv2,Family2,Indiv3,MeanIBS
...

For each pair of individuals in my ped file. I need it like this:

Indiv1 Indiv2 Indiv3 Indiv4 ....
Indiv1
Indiv2
Indiv3
Indiv4
....

Its a fairly simple scripting job... Now that I have a MATRIX, I load that into R as a dataset.

ibs<-read.delim("C:/Users/bushws1/Desktop/matrix.txt",Header=FALSE)


Next, we'll run hclust using this distance matrix
HC <- hclust(as.dist(ibs))


The as.dist(ibs) transforms the dataframe called "ibs" into a distance matrix in the R environment. Using this command, we are asking R to generate hierarchical groups of individuals from our data based on genetic similarity. We now have a hierarchical clustering object called "HC". We can plot the results of our cluster analysis using this command:
plot(HC)

If your dataset is small, this might work well for you, but for most genomics applications, you'll get a tree-shaped fuzzball like this:


The solution to this is to load a library from the Bioconductor package, called "ctc". This will let us export the cluster object in the Newick file format. It can then be imported into other more powerful graphing programs. This is done like so:

library(ctc)
write.table(hc2Newick(HC), file="C:/Users/bushws1/Desktop/ibs_new.txt",row.names=FALSE,col.names=FALSE)

You now have a file in Newick format, but R puts quotes around the output for some annoying reason. Open the file in notepad and remove the quotes and it should be ready to use.

To get a better, more readable plot, download "Dendroscope" from the University of Tubingen (nice place, by the way). Dendroscope will let you import the Newick file you created and gives you extensive plotting options. Check out this wicked Circular Cladogram...



In this example, there isn't an easy, intuitive explanation for the groupings, but for other examples, the groups might make more sense. There are lots of options for computing the clustering, and they may give very different results, so proceed with caution, but in general hierarchical clustering can be a useful tool for lots of data analysis situations.