R: cluster analysis

Hierarchical Clustering

Calculate distance matrix (euclidean distance as default):

dist(tab)

Use first 3 columns:

dist(tab[,1:3])

Hierarchical clustering with euclidean distance and average linkage:

hc <- hclust(distance_matrix)

Print the dendrogram:

plot(as.dendrogram(hc))

Dendrogram function with different graphic options:

library(ggdendro)
ggdendrogram(hc, theme_dendro = FALSE)

Another dendrogram option:

library(devtools)
myplclust(hc, lab.col = unclass(tab$col))  #lab.col to color based on column value
abline(h=1.5,col="red")   #dendrogram cut

K-Means Clustering

k-means non-hierarchical clustering with two groups:

kmeans(tab,centers=2)

Exclude columns 11 and 12 and divide into 5 groups:

Clust <- kmeans(tab[,-c(11:12)], centers=5)

Built a table having every different version of col for columns and the different groups as rows, to see how the col value are distributed inside the groups:

table(Clust$cluster, tab$col)

Other useful parameters:

  • iter.max: max iteration number before stop.

  • nstart: different number of centroids to stard. nstart = 100 means 100 different test with different centroids, then it can choose the best.