Building a Poisson regression model in R

Poisson regression

A Poisson regression model allows you to model the relationship between a Poisson distributed response variable and one or more explanatory variables. It is suitable for modelling the number of events that occur in a given time period or area.

The explanatory variables can be either numeric or categorical. You can also include interaction terms in the model.

You can build a Poisson regression model with the glm function. For example, to build a model with a response variable named counts and three explanatory variables named var1, var2 and var3, use the command:

> glm(counts~var1+var2+var3, dataset, family=poisson)

Setting the family argument to poisson tells R to treat the response variable as Poisson distributed and build a Poisson regression model using the log link function.

The glm output displays some basic information about the model including the coefficient estimates. However there are more components to the output that are not automatically displayed, such as summary statistics, residuals and fitted values. You can save the all of this output to an object, as shown below.

> modelname<-glm(counts~var1+var2+var3, dataset, family=poisson)

Once you have created a glm object, you can access the various components of the results in the same way that you would for any other R model output object, using functions such as summary, anova, coef and residuals. For example, to view a summary of the model, use the command:

> summary(modelname)

Other useful functions include

Example: Poisson regression using warpbreaks data

Consider the warpbreaks dataset, which is included with R. The dataset gives the results of an experiment to determine the effect of wool type (A or B) and tension (low, medium or high) on the number of warp breaks per loom. Data was collected for nine looms for each combination of settings. To view more information about the dataset, enter help(warpbreaks). To view the dataset, enter the dataset name:

> warpbreaks
   breaks wool tension
1      26    A       L
2      30    A       L
3      54    A       L
4      25    A       L
5      70    A       L
6      52    A       L
7      51    A       L
8      26    A       L
9      67    A       L
10     18    A       M
11     21    A       M
12     29    A       M
13     17    A       M
14     12    A       M
15     ...

54     28    B       H

Suppose that you want to build a model to see how the frequency of warp breaks is effected by wool type, wool tension and the interaction between the two.

As the response variable breaks is a count, it is best modelled as a Poisson distributed variable. The model should have terms for the wool type, wool tension and the interaction between the two.

To build the model, use the command:

> breaksmodel<-glm(breaks~wool*tension, warpbreaks, family=poisson)

Once you have created the model object, use the summary function to view the results:

> summary(breaksmodel)

The output is shown below.

Call:
glm(formula = breaks ~ wool * tension, family = poisson, data = warpbreaks)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.3383  -1.4844  -0.1291   1.1725   3.5153  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     3.79674    0.04994  76.030  < 2e-16 ***
woolB          -0.45663    0.08019  -5.694 1.24e-08 ***
tensionM       -0.61868    0.08440  -7.330 2.30e-13 ***
tensionH       -0.59580    0.08378  -7.112 1.15e-12 ***
woolB:tensionM  0.63818    0.12215   5.224 1.75e-07 ***
woolB:tensionH  0.18836    0.12990   1.450    0.147    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 297.37  on 53  degrees of freedom
Residual deviance: 182.31  on 48  degrees of freedom
AIC: 468.97

Number of Fisher Scoring iterations: 4

From the results we can see that wool, tension and their interaction all have a significant effect on the number of warp breaks.

Performing Bartlett’s test in R

Bartlett’s test

Bartlett’s test allows you to compare the variance of two or more samples to determine whether they are drawn from populations with equal variance. It is suitable for normally distributed data. The test has the null hypothesis that the variances are equal and the alterntive hypothesis that they are not equal.1

This test is useful for checking the assumptions of an analysis of variance.

You can perform Bartlett’s test with the bartlett.test function. If your data is in stacked form (with the values for both samples stored in one variable), use the command:

> bartlett.test(values~groups, dataset)

where values is the name of the variable containing the data values and groups is the name of the variable that specifies which sample each value belongs too.

If your data is in unstacked form (with the samples stored in separate variables) nest the variable names inside the list function as shown below.

> bartlett.test(list(dataset$sample1, dataset$sample2, dataset$sample3))

If you are unsure whether your data is in stacked or unstacked form, see the article Stacking a dataset in R for examples of data in both forms.

Example 10.10. Bartlett’s test using the PlantGrowth data

Consider the PlantGrowth dataset (included with R), which gives the dried weight of three groups of ten batches of plants, where each group of ten batches received a different treatment. The weight variable gives the weight of the batch and the groups variable gives the treatment received (either ctrl, trt1 or trt2). To view more information about the dataset, enter help(PlantGrowth). To view the data, enter the dataset name:

> PlantGrowth
   weight group
1    4.17  ctrl
2    5.58  ctrl
3    5.18  ctrl
4    6.11  ctrl
5    4.50  ctrl
6    4.61  ctrl
7    5.17  ctrl
8    4.53  ctrl
9    5.33  ctrl
10   5.14  ctrl
11   4.81  trt1
12   4.17  trt1
13   ...

30   5.26  trt2

Suppose you want to use Bartlett’s test to determine whether the the variance in weight is the same for all treatment groups. A significance level of 0.05 will be used.

To perform the test, use the command:

> bartlett.test(weight~group, PlantGrowth)

This gives the output:

        Bartlett test of homogeneity of variances

data:  weight by group 
Bartlett's K-squared = 2.8786, df = 2, p-value = 0.2371

From the output we can see that the p-value of 0.2371 is not less than the significance level of 0.05. This means we cannot reject the null hypothesis that the variance is the same for all treatment groups. This means that there is no evidence to suggest that the variance in plant growth is different for the three treatment groups.

[1] Montgomery, D.C. and Runger, G.C., 2007. Applied Statistics and Probability for Engineers, 4th ed. John Wiley & Sons.

Creating a 3D scatter plot in R

3D scatter plot

A 3D scatter plot is similar to an ordinary scatter plot, but for three continuous variables. It allows you to examine the relationship between them.

R has a special function for creating three dimensional scatter plots, called scatterplot3d. This function is not part of the base R installation, but part of an add-on package written by Uwe Ligges which is also called scatterplot3d. You must install and load the package before you can use the scatterplot3d function. You can learn how to do this in the article Installing and loading add-on packages in R

Once you have loaded the package, you can create a 3D scatter plot with the command:

> scatterplot3d(dataset$xvar, dataset$yvar, dataset$zvar)

where xvar, yvar and zvar are the variables that you want to display on the x, y and z axes.

Example: 3D scatter plot using the trees data

Consider the trees dataset (included with R), which gives the girth, height and volume of 31 trees. You can view more information about the dataset by entering help(trees). To view the dataset, enter the dataset name:

> trees
   Girth Height Volume
1    8.3     70   10.3
2    8.6     65   10.3
3    8.8     63   10.2
4   10.5     72   16.4
5   10.7     81   18.8
6   10.8     83   19.7
7   11.0     66   15.6
8   11.0     75   18.2
9   11.1     80   22.6
10  11.2     75   19.9
...
31  20.6     87   77.0

To create a 3D scatter plot of Volume against Girth and Height, use the command:

> scatterplot3d(trees$Girth, trees$Height, trees$Volume)

The result is shown below.

3D scatter plot

 

Installing and loading add-on packages in R

Over three thousand add-on packages are available for R, which serve a wide variety of purposes. An add-on package contains additional functions and sometimes objects such as example datasets. This article explains how you can find a package that serves your purpose and install it.

Viewing a list of available add-on packages

To view a list of available add-on packages, follow the instructions below:

  1. Go to the R project website at www.r-project.org.
  2. Follow the link to ‘CRAN’ (on the left-hand side).
  3. You will be taken to a list of sites that host the R installation files (mirror sites). Select a site close to your location.
  4. Select ‘Packages’ from the menu on the left-hand side.
  5. Select ‘Table of available packages, sorted by name’.

A list of packages with descriptions of their purpose is displayed. You can use the browser tools to search the list, usually by entering Ctrl+F or Cmd+F.

On selecting a suitable package, a manual is available in pdf format. You will notice that the package is available to download, but you do not need to do this as it is simpler to install the package from within the R environment.

Installing and loading add-on packages

To use an add-on package you must first install it, which only needs to be done once. There are a number of packages that are included with the R base installation, which do not need to be installed.

Once a package is installed, it must be loaded before you can use the functions within. The functions will be available for the remainder of the session, so you will need to load the package during each session that you intend to use it.

You can install and load packages from within the R environment, which is explained separately for Windows, Mac and Linux users.

Windows users

To install a package:

  1. Select ‘Install package(s)’ from the ‘Package’ menu.
  2. The first time you install a package, you will be prompted to select a mirror site. Select a site close to your location.
  3. When prompted, selected the required package from the list.

To load a package:

  1. Select ‘Load package’ from the ‘Packages’ menu.
  2. When prompted, select the required package from the list.

Mac users

To install a package:

  1. Select ‘R Package Installer’ from the ‘Packages & Data’ menu.
  2. Press ‘Get List’
  3. A list of packages is displayed. Select the required package and press ‘Install Selected’.
  4. Close the window.

To load a package:

  1. Select ‘R Package Manager’ from the ‘Packages & Data’ menu.
  2. Tick the status box next to the required package so that the status changes to ‘loaded’.

Linux users

To install a package:

  1. Enter the command:
    download.packages(packagename, "/home/Username/folder")

    The file path gives the location in which to save the package.

  2. When prompted, select a mirror site close to your location.

To load a package:

  1. Enter the command:
    install.packages("packagename")

Exporting a dataset from R

R allows you to export datasets from the R workspace to the CSV and tab-delimited file formats.

To export a dataset named dataset to a CSV file, use the write.csv function.

> write.csv(dataset, "filename.csv")

For example, to export the Puromycin dataset (included with R) to a file names puromycin_data.csv, use the command:

> write.csv(Puromycin, "puromycin_data.csv")

This command creates the file and saves it to your working directory, which by default is your ‘My Documents’ folder (for Windows users) or your home folder (for Mac and Linux users). To save the file somewhere other than in the working directory, enter the full path for the file as shown.

> write.csv(dataset, "C:/folder/filename.csv")

If a file with your chosen name already exists in the specified location, R overwrites the original file without giving a warning. You should check the files in the destination folder beforehand to make sure you are not overwriting anything important.

The write.table function allows you to export data to a wider range of file formats, including tab-delimited files. Use the sep argument to specify which character should be used to separate the values. To export a dataset to a tab-delimited file, set the sep argument to "\t" (which denotes the tab symbol), as shown below.

> write.table(dataset, "filename.txt, sep="\t")

By default, the write.csv and write.table functions create an extra column in the file containing the observation numbers. To prevent this, set the row.names argument to F.

> write.csv(dataset, "filename.csv, row.names=F)

With the read.table function, you can also prevent variable names from being placed in the first line of the file with the col.names argument.

> write.table(dataset, "filename.txt, sep="\t", col.names=F)

Social Widgets powered by AB-WebLog.com.