<h3 id="heading-hierarchical-clustering">Hierarchical Clustering</h3>
Calculate distance matrix (euclidean distance as default):
<pre><code class="lang-sql">dist(tab)
</code></pre>
Use first 3 columns:
<pre><code class="lang-sql">dist(tab[,1:3])
</code></pre>
Hierarchical clustering with euclidean distance and average linkage:
<pre><code class="lang-sql">hc &lt;- hclust(distance_matrix)
</code></pre>
Print the dendrogram:
<pre><code class="lang-sql">plot(as.dendrogram(hc))
</code></pre>
Dendrogram function with different graphic options:
<pre><code class="lang-sql">library(ggdendro)
ggdendrogram(hc, theme_dendro = FALSE)
</code></pre>
Another dendrogram option:
<pre><code class="lang-sql">library(devtools)
myplclust(hc, lab.col = unclass(tab$col)) #lab.col to color based on column value
abline(h=1.5,col="red") #dendrogram cut
</code></pre>
<h3 id="heading-k-means-clustering">K-Means Clustering</h3>
k-means non-hierarchical clustering with two groups:
<pre><code class="lang-sql">kmeans(tab,centers=2)
</code></pre>
Exclude columns 11 and 12 and divide into 5 groups:
<pre><code class="lang-sql">Clust &lt;- kmeans(tab[,-c(11:12)], centers=5)
</code></pre>
Built a table having every different version of col for columns and the different groups as rows, to see how the col value are distributed inside the groups:
<pre><code class="lang-sql">table(Clust$cluster, tab$col)
</code></pre>
Other useful parameters:
<ul>
<li>iter.max: max iteration number before stop.
</li>
<li>nstart: different number of centroids to stard. nstart = 100 means 100 different test with different centroids, then it can choose the best.
</li>
</ul>

### Hierarchical Clustering

Calculate distance matrix (euclidean distance as default):

```sql
dist(tab)
```

Use first 3 columns:

```sql
dist(tab[,1:3])
```

Hierarchical clustering with euclidean distance and average linkage:

```sql
hc <- hclust(distance_matrix)
```

Print the dendrogram:

```sql
plot(as.dendrogram(hc))
```

Dendrogram function with different graphic options:

```sql
library(ggdendro)
ggdendrogram(hc, theme_dendro = FALSE)
```

Another dendrogram option:

```sql
library(devtools)
myplclust(hc, lab.col = unclass(tab$col)) #lab.col to color based on column value
abline(h=1.5,col="red") #dendrogram cut
```

### K-Means Clustering

k-means non-hierarchical clustering with two groups:

```sql
kmeans(tab,centers=2)
```

Exclude columns 11 and 12 and divide into 5 groups:

```sql
Clust <- kmeans(tab[,-c(11:12)], centers=5)
```

Built a table having every different version of col for columns and the different groups as rows, to see how the col value are distributed inside the groups:

```sql
table(Clust$cluster, tab$col)
```

Other useful parameters:

* iter.max: max iteration number before stop.
 
* nstart: different number of centroids to stard. nstart = 100 means 100 different test with different centroids, then it can choose the best.

R: cluster analysis

data engineer • T-SQL developer • r and python • Power BI and tableau • Azure • machine learning enthusiast • data lover