Performing a cluster analysis in R

Cluster analysis

A cluster analysis allows you summarise a dataset by grouping similar observations together into clusters. Observations are judged to be similar if they have similar values for a number of variables (i.e. a short Euclidean distance between them).

You can perform a cluster analysis with the dist and hclust functions. The dist function calculates a distance matrix for your dataset, giving the Euclidean distance between any two observations. The hclust function performs hierarchical clustering on a distance matrix. So to perform a cluster analysis from your raw data, use both functions together as shown below.

> modelname<-hclust(dist(dataset))

The command saves the results of the analysis to an object named modelname.

The results of a cluster analysis are best represented by a dendrogram, which you can create with the plot function as shown.

> plot(modelname)

Be default, the row numbers or row names are used to label the observations. However you can use the labels argument to select a variable to use for the labels.

> plot(modelname, labels=dataset$variable)

To ‘cut’ the dendrogram to identify a given number of clusters, use the rect.hclust function immediately after the plot function as shown below:

> plot(modelname)
> rect.hclust(modelname, n)

where n is the number of clusters that you want to identify.

Alternatively you can cut the dendrogram at a specific height by adding the h argument.

> plot(modelname)
> rect.hclust(modelname, h=height)

To save the cluster numbers to a new variable in the dataset, use the cutree function.

> dataset$clusternumber<-cutree(modelname, n)

Example: Cluster analysis of europe dataset

Consider the europe dataset, which is available in CSV format here. The data is taken from the CIA World Factbook and gives some information about 28 european countries.

> europe
          Country   Area   GDP Inflation Life.expect Military Pop.growth Unemployment
1         Austria  83871 41600       3.5       79.91     0.80       0.03          4.2
2         Belgium  30528 37800       3.5       79.65     1.30       0.06          7.2
3        Bulgaria 110879 13800       4.2       73.84     2.60      -0.80          9.6
4         Croatia  56594 18000       2.3       75.99     2.39      -0.09         17.7
5  Czech Republic  78867 27100       1.9       77.38     1.15      -0.13          8.5
6         Denmark  43094 37000       2.8       78.78     1.30       0.24          6.1
28 United Kingdom 243610 36500       4.5       80.17     2.70       0.55          8.1

To perform the cluster analysis and save the results to an object, use the command:

> euroclust<-hclust(dist(europe[-1]))

To plot the dendrogram, use the command:

> plot(euroclust, labels=europe$Country)

The result is shown below.


To add rectangles identifying five clusters, use the command:

> rect.hclust(euroclust, 5)

The result is shown below.


From the dendrogram, we can see that the cluster analysis has placed Ukraine in it’s own group; Spain and Sweden in the second group; the UK, Finland, Germany and others in the third group; Bulgaria, Greece, Austria and others in the fourth group; and Luxembourg, Estonia, Slovakia and others in the fifth group.

Performing a one-sample t-test in R

One-sample t-test

A t-test is used to test hypotheses about the mean value of a population from which a sample is drawn. A t-test is suitable if the data is believed to be drawn from a normal distribution, or if the sample size is large.

A one-sample t-test is used to compare the mean value of a sample with a constant value denoted μ0. The test has the null hypothesis that the population mean is equal to μ0 and the alternative hypothesis that it is not equal to μ0.

The test can also be performed with a one-sided alternative hypothesis, which is known as a one-tailed test. The one-sided alternative hypothesis is either than the population mean is less than μ0 or that the population mean is greater than μ0.

You can perform a one-sample t-test with the t.test function. To compare a sample mean with a constant value mu0, use the command:

> t.test(dataset$sample1, mu=mu0)

The mu argument gives the value with which you want to compare the sample mean. It is optional and has a default value of zero.

By default, R performs a two-tailed test. To perform a one-tailed test, set the alternative argument to "greater" or "less", as shown below.

> t.test(dataset$sample1, mu=mu0, alternative="greater")

A 95% confidence interval for the population mean is included with the output. To adjust the size of the interval, use the conf.level argument.

> t.test(dataset$sample1, mu=mu0, conf.level=0.99)

Example 10.1. One-tailed, one-sample t-test using the bottles data

A bottle filling machine is set to fill bottles with soft drink to a volume of 500 ml. The actual volume is known to follow a normal distribution. The manufacturer believes the machine is under-filling bottles. A sample of 20 bottles is taken and the volume of liquid inside is measured. The results are given in the bottles dataset, which is available here.

> bottles
1  484.11
2  459.49
3  471.38
4  512.01
5  494.48
6  528.63
7  493.64
8  485.03
9  473.88
10 501.59
11 502.85
12 538.08
13 465.68
14 495.03
15 475.32
16 529.41
17 518.13
18 464.32
19 449.08
20 489.27

To calculate the sample mean, use the command:

> mean(bottles$Volume)
[1] 491.5705

Suppose you want to use a one-sample t-test to determine whether the bottles are being consistently under filled, or whether the low mean volume for the sample is purely the result of random variation. A one-sided test is suitable because the manufacturer is specifically interested in knowing whether the volume is less than 500 ml. The test has the null hypothesis that the mean filling volume is equal to 500 ml, and the alternative hypothesis that the mean filling volume is less than 500 ml. A significance level of 0.01 is to be used.

To perform the test, use the command:

> t.test(bottles$Volume, mu=500, alternative="less", conf.level=0.99)

This gives the following output:

One Sample t-test
data: bottles$Volume
t = -1.5205, df = 19, p-value = 0.07243
alternative hypothesis: true mean is less than 500
99 percent confidence interval:
     -Inf 505.6495
sample estimates:
mean of x

From the output, we can see that the mean bottle volume for the sample is 491.6 ml. The one-sided 99% confidence interval tells us that mean filling volume is likely to be less than 505.6 ml. The p-value of 0.07243 tells us that if the mean filling volume of the machine were 500 ml, the probability of selecting a sample with a mean volume less than or equal to this one would be approximately 7%.

Since the p-value is not less than the significance level of 0.01, we cannot reject the null hypothesis that the mean filling volume is equal to 500 ml. This means that there is no evidence that the bottles are being under-filled.

Creating a scree plot in R

Scree plot

A scree plot displays the proportion of the total variation in a dataset that is explained by each of the components in a principle component analysis. It helps you to identify how many of the components are needed to summarise the data.

To create a scree plot of the components, use the screeplot function.

> screeplot(modelname)

where modelname is the name of a previously saved principle component analysis, created with the princomp function as explained in the article Performing a principle component analysis in R.

Example: Scree plot for the iris dataset

In a the article Performing a principal component analysis with R we performed a principle component analysis for the iris dataset, and saved it to an object named irispca.

To create a scree plot of the components, use the command:

> screeplot(irispca)

The result is shown below.

Scree plot

From the scree plot we can see that the amount of variation explained drops dramatically after the first component. This suggests that just one component may be sufficient to summarise the data.

Performing a principal component analysis in R

Principal component analysis

A principal component analysis (or PCA) is a way of simplifying a complex multivariate dataset. It helps to expose the underlying sources of variation in the data.

You can perform a principal component analysis with the princomp function as shown below.

> princomp(dataset)

The dataset should contain numeric variables only. If there are any non-numeric variables in your dataset, you must exclude them with bracket notation or with the subset function.

The princomp output displays the standard deviations of the components. However there are more elements of the output that are not automatically displayed, including the loadings and scores. You can save the all of this output to an object, as shown below.

> modelname<-princomp(dataset)

Once you have saved the output to an object, you can use further functions to view the various elements of the output. For example, you can use the summary function to view the proportion of the total variance explained by each component:

> summary(modelname)

To view the loadings for each component, use the command:

> modelname$loadings

Similarly you can view the scores for each of the observations as shown:

> modelname$scores

To create a scree plot, please see the article Creating a scree plot with R.

Example: Principal component analysis using the iris data

Consider the iris dataset (included with R) which gives the petal width, petal length, sepal width, sepal length and species for 150 irises. To view more information about the dataset, enter help(iris).

You can view the dataset by entering the dataset name:

> iris
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa
150          5.9         3.0          5.1         1.8  virginica

The dataset contains a factor variable (Species) which must be excluded when performing the PCA. So to perform the analysis and save the results to an object, use the command:

> irispca<-princomp(iris[-5])

To view the proportion of the total variance explained by each component, use the command:

> summary(irispca)
Importance of components:
                          Comp.1     Comp.2     Comp.3      Comp.4
Standard deviation     2.0494032 0.49097143 0.27872586 0.153870700
Proportion of Variance 0.9246187 0.05306648 0.01710261 0.005212184
Cumulative Proportion  0.9246187 0.97768521 0.99478782 1.000000000

From the output we can see that 92.4% of the variation in the dataset is explained by the first component alone, and 97.8% is explained by the first two components.

To view the loadings for the components, use the command:

> irispca$loadings
             Comp.1 Comp.2 Comp.3 Comp.4
Sepal.Length  0.361 -0.657  0.582  0.315
Sepal.Width         -0.730 -0.598 -0.320
Petal.Length  0.857  0.173        -0.480
Petal.Width   0.358        -0.546  0.754

               Comp.1 Comp.2 Comp.3 Comp.4
SS loadings      1.00   1.00   1.00   1.00
Proportion Var   0.25   0.25   0.25   0.25
Cumulative Var   0.25   0.50   0.75   1.00

To view the scores for each observation, use the command:

> irispca$scores
             Comp.1       Comp.2       Comp.3        Comp.4
  [1,] -2.684125626 -0.319397247  0.027914828  0.0022624371
  [2,] -2.714141687  0.177001225  0.210464272  0.0990265503
  [3,] -2.888990569  0.144949426 -0.017900256  0.0199683897
  [4,] -2.745342856  0.318298979 -0.031559374 -0.0755758166
  [5,] -2.728716537 -0.326754513 -0.090079241 -0.0612585926
  [6,] -2.280859633 -0.741330449 -0.168677658 -0.0242008576
  [7,] -2.820537751  0.089461385 -0.257892158 -0.0481431065
  [8,] -2.626144973 -0.163384960  0.021879318 -0.0452978706
  [9,] -2.886382732  0.578311754 -0.020759570 -0.0267447358
 [10,] -2.672755798  0.113774246  0.197632725 -0.0562954013
[150,]  1.390188862  0.282660938 -0.362909648 -0.1550386282

This example is continued in the article Creating a scree plot with R.

Creating an interaction plot in R

Interaction plot

An interaction plot is a visual representation of the interaction between the effects of two factors, or between a factor and a numeric variable. It is suitable for experimental data.

You can create an interaction plot with the interaction.plot function. The command takes the general form:

> interaction.plot(dataset$var1, dataset$var2, dataset$response)

where var1 and var2 are the names of the explanatory variables and response is the name of the response variable.

If one of the explanatory variables is numeric and the other is a factor, list the numeric variable first and the factor second. This way the numeric variable is displayed along the x-axis and the factor is represented by separate lines on the plot.

Example: Interaction plot with ToothGrowth data

Consider the ToothGrowth dataset, which is included with R. The dataset gives the results of an experiment to determine the effect of two supplements (Vitamin C and Orange Juice), each at three different doses (0.5, 1 or 2 mg) on tooth length in guinea pigs. The len variable gives the tooth growth, the supp variable gives the supplement type and the dose variable gives the supplement dose. You can view more information about the ToothGrowth dataset by entering help(ToothGrowth).

To create an interaction plot illustrating the interaction between supplement type and supplement dose, use the command:

> interaction.plot(ToothGrowth$dose, ToothGrowth$supp, ToothGrowth$len)

The results are shown below.

Interaction plot
Interaction plot for the ToothGrowth data

Social Widgets powered by