Category Archives: Multivariate analysis

Performing a cluster analysis in R

Cluster analysis

A cluster analysis allows you summarise a dataset by grouping similar observations together into clusters. Observations are judged to be similar if they have similar values for a number of variables (i.e. a short Euclidean distance between them).

You can perform a cluster analysis with the dist and hclust functions. The dist function calculates a distance matrix for your dataset, giving the Euclidean distance between any two observations. The hclust function performs hierarchical clustering on a distance matrix. So to perform a cluster analysis from your raw data, use both functions together as shown below.

> modelname<-hclust(dist(dataset))

The command saves the results of the analysis to an object named modelname.

The results of a cluster analysis are best represented by a dendrogram, which you can create with the plot function as shown.

> plot(modelname)

Be default, the row numbers or row names are used to label the observations. However you can use the labels argument to select a variable to use for the labels.

> plot(modelname, labels=dataset$variable)

To ‘cut’ the dendrogram to identify a given number of clusters, use the rect.hclust function immediately after the plot function as shown below:

> plot(modelname)
> rect.hclust(modelname, n)

where n is the number of clusters that you want to identify.

Alternatively you can cut the dendrogram at a specific height by adding the h argument.

> plot(modelname)
> rect.hclust(modelname, h=height)

To save the cluster numbers to a new variable in the dataset, use the cutree function.

> dataset$clusternumber<-cutree(modelname, n)

Example: Cluster analysis of europe dataset

Consider the europe dataset, which is available in CSV format here. The data is taken from the CIA World Factbook and gives some information about 28 european countries.

> europe
          Country   Area   GDP Inflation Life.expect Military Pop.growth Unemployment
1         Austria  83871 41600       3.5       79.91     0.80       0.03          4.2
2         Belgium  30528 37800       3.5       79.65     1.30       0.06          7.2
3        Bulgaria 110879 13800       4.2       73.84     2.60      -0.80          9.6
4         Croatia  56594 18000       2.3       75.99     2.39      -0.09         17.7
5  Czech Republic  78867 27100       1.9       77.38     1.15      -0.13          8.5
6         Denmark  43094 37000       2.8       78.78     1.30       0.24          6.1
28 United Kingdom 243610 36500       4.5       80.17     2.70       0.55          8.1

To perform the cluster analysis and save the results to an object, use the command:

> euroclust<-hclust(dist(europe[-1]))

To plot the dendrogram, use the command:

> plot(euroclust, labels=europe$Country)

The result is shown below.


To add rectangles identifying five clusters, use the command:

> rect.hclust(euroclust, 5)

The result is shown below.


From the dendrogram, we can see that the cluster analysis has placed Ukraine in it’s own group; Spain and Sweden in the second group; the UK, Finland, Germany and others in the third group; Bulgaria, Greece, Austria and others in the fourth group; and Luxembourg, Estonia, Slovakia and others in the fifth group.

Creating a scree plot in R

Scree plot

A scree plot displays the proportion of the total variation in a dataset that is explained by each of the components in a principle component analysis. It helps you to identify how many of the components are needed to summarise the data.

To create a scree plot of the components, use the screeplot function.

> screeplot(modelname)

where modelname is the name of a previously saved principle component analysis, created with the princomp function as explained in the article Performing a principle component analysis in R.

Example: Scree plot for the iris dataset

In a the article Performing a principal component analysis with R we performed a principle component analysis for the iris dataset, and saved it to an object named irispca.

To create a scree plot of the components, use the command:

> screeplot(irispca)

The result is shown below.

Scree plot

From the scree plot we can see that the amount of variation explained drops dramatically after the first component. This suggests that just one component may be sufficient to summarise the data.

Performing a principal component analysis in R

Principal component analysis

A principal component analysis (or PCA) is a way of simplifying a complex multivariate dataset. It helps to expose the underlying sources of variation in the data.

You can perform a principal component analysis with the princomp function as shown below.

> princomp(dataset)

The dataset should contain numeric variables only. If there are any non-numeric variables in your dataset, you must exclude them with bracket notation or with the subset function.

The princomp output displays the standard deviations of the components. However there are more elements of the output that are not automatically displayed, including the loadings and scores. You can save the all of this output to an object, as shown below.

> modelname<-princomp(dataset)

Once you have saved the output to an object, you can use further functions to view the various elements of the output. For example, you can use the summary function to view the proportion of the total variance explained by each component:

> summary(modelname)

To view the loadings for each component, use the command:

> modelname$loadings

Similarly you can view the scores for each of the observations as shown:

> modelname$scores

To create a scree plot, please see the article Creating a scree plot with R.

Example: Principal component analysis using the iris data

Consider the iris dataset (included with R) which gives the petal width, petal length, sepal width, sepal length and species for 150 irises. To view more information about the dataset, enter help(iris).

You can view the dataset by entering the dataset name:

> iris
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa
150          5.9         3.0          5.1         1.8  virginica

The dataset contains a factor variable (Species) which must be excluded when performing the PCA. So to perform the analysis and save the results to an object, use the command:

> irispca<-princomp(iris[-5])

To view the proportion of the total variance explained by each component, use the command:

> summary(irispca)
Importance of components:
                          Comp.1     Comp.2     Comp.3      Comp.4
Standard deviation     2.0494032 0.49097143 0.27872586 0.153870700
Proportion of Variance 0.9246187 0.05306648 0.01710261 0.005212184
Cumulative Proportion  0.9246187 0.97768521 0.99478782 1.000000000

From the output we can see that 92.4% of the variation in the dataset is explained by the first component alone, and 97.8% is explained by the first two components.

To view the loadings for the components, use the command:

> irispca$loadings
             Comp.1 Comp.2 Comp.3 Comp.4
Sepal.Length  0.361 -0.657  0.582  0.315
Sepal.Width         -0.730 -0.598 -0.320
Petal.Length  0.857  0.173        -0.480
Petal.Width   0.358        -0.546  0.754

               Comp.1 Comp.2 Comp.3 Comp.4
SS loadings      1.00   1.00   1.00   1.00
Proportion Var   0.25   0.25   0.25   0.25
Cumulative Var   0.25   0.50   0.75   1.00

To view the scores for each observation, use the command:

> irispca$scores
             Comp.1       Comp.2       Comp.3        Comp.4
  [1,] -2.684125626 -0.319397247  0.027914828  0.0022624371
  [2,] -2.714141687  0.177001225  0.210464272  0.0990265503
  [3,] -2.888990569  0.144949426 -0.017900256  0.0199683897
  [4,] -2.745342856  0.318298979 -0.031559374 -0.0755758166
  [5,] -2.728716537 -0.326754513 -0.090079241 -0.0612585926
  [6,] -2.280859633 -0.741330449 -0.168677658 -0.0242008576
  [7,] -2.820537751  0.089461385 -0.257892158 -0.0481431065
  [8,] -2.626144973 -0.163384960  0.021879318 -0.0452978706
  [9,] -2.886382732  0.578311754 -0.020759570 -0.0267447358
 [10,] -2.672755798  0.113774246  0.197632725 -0.0562954013
[150,]  1.390188862  0.282660938 -0.362909648 -0.1550386282

This example is continued in the article Creating a scree plot with R.

Social Widgets powered by