Performing a cluster analysis in R

Cluster analysis

A cluster analysis allows you summarise a dataset by grouping similar observations together into clusters. Observations are judged to be similar if they have similar values for a number of variables (i.e. a short Euclidean distance between them).

You can perform a cluster analysis with the dist and hclust functions. The dist function calculates a distance matrix for your dataset, giving the Euclidean distance between any two observations. The hclust function performs hierarchical clustering on a distance matrix. So to perform a cluster analysis from your raw data, use both functions together as shown below.

> modelname<-hclust(dist(dataset))

The command saves the results of the analysis to an object named modelname.

The results of a cluster analysis are best represented by a dendrogram, which you can create with the plot function as shown.

> plot(modelname)

Be default, the row numbers or row names are used to label the observations. However you can use the labels argument to select a variable to use for the labels.

> plot(modelname, labels=dataset$variable)

To ‘cut’ the dendrogram to identify a given number of clusters, use the rect.hclust function immediately after the plot function as shown below:

> plot(modelname)
> rect.hclust(modelname, n)

where n is the number of clusters that you want to identify.

Alternatively you can cut the dendrogram at a specific height by adding the h argument.

> plot(modelname)
> rect.hclust(modelname, h=height)

To save the cluster numbers to a new variable in the dataset, use the cutree function.

> dataset$clusternumber<-cutree(modelname, n)

Example: Cluster analysis of europe dataset

Consider the europe dataset, which is available in CSV format here. The data is taken from the CIA World Factbook and gives some information about 28 european countries.

> europe
          Country   Area   GDP Inflation Life.expect Military Pop.growth Unemployment
1         Austria  83871 41600       3.5       79.91     0.80       0.03          4.2
2         Belgium  30528 37800       3.5       79.65     1.30       0.06          7.2
3        Bulgaria 110879 13800       4.2       73.84     2.60      -0.80          9.6
4         Croatia  56594 18000       2.3       75.99     2.39      -0.09         17.7
5  Czech Republic  78867 27100       1.9       77.38     1.15      -0.13          8.5
6         Denmark  43094 37000       2.8       78.78     1.30       0.24          6.1
...
28 United Kingdom 243610 36500       4.5       80.17     2.70       0.55          8.1

To perform the cluster analysis and save the results to an object, use the command:

> euroclust<-hclust(dist(europe[-1]))

To plot the dendrogram, use the command:

> plot(euroclust, labels=europe$Country)

The result is shown below.

Dendrogram

To add rectangles identifying five clusters, use the command:

> rect.hclust(euroclust, 5)

The result is shown below.

Dendrogram

From the dendrogram, we can see that the cluster analysis has placed Ukraine in it’s own group; Spain and Sweden in the second group; the UK, Finland, Germany and others in the third group; Bulgaria, Greece, Austria and others in the fourth group; and Luxembourg, Estonia, Slovakia and others in the fifth group.


Social Widgets powered by AB-WebLog.com.