STATISTICS with R
A comprehensive guide to statistical analysis in R
One-way ANOVA in R
The one-way analysis of variance or one-way ANOVA is a statistical method that compares the mean values of several groups on a continuous dependent variable to determine if any difference in mean values across the groups is statistically significant. One-way ANOVA is usually used when the number of groups for comparison is three or more. However, one-way ANOVA can also be used instead of independent samples t-test to compare the mean values between two groups. A one-way ANOVA is usually followed by a pairwise group comparison tests, technically called post hoc tests.
Introduction to One-way ANOVA
There are situations where researchers are interested in comparing the mean values across three or more groups. For example, a researcher may be interested in investigating the effect of different education levels (high school and lower, bachelor’s degree, graduate degree) on income. In this example, education level has three levels / groups and is the independent variable or the factor and income is the dependent variable, which is continuous.
When using ANOVA, the independent variable is also called a factor. A factor is a categorical variable with two or more categories or levels. For example, education attainment is a factor with three levels: high school and lower, bachelor’s degree, and graduate degree. Because the independent variable is called a factor in ANOVA analysis, experimental designs using ANOVA may be referred to as factorial designs. If the research design has only one factor, we call the analysis of variance as one-way ANOVA, which is the traditional name for one-factor ANOVA. If the research design includes two factors, it is called a two-way ANOVA, and so on. The term factorial ANOVA is sometimes used to denote two-way or more factors in the design.
In an ANOVA test, the primary purpose is to see if the factor as the aggregate of its categories has a significant effect on the dependent variable. If the effect of factor on the dependent variable is statistically significant, then we want to know which categories are significantly different from each other. If the factor overall is not statistically significant, we stop there and do not compare its constituent categories.
In the following sections, we present an example research scenario where a one-way ANOVA will be used to compare mean values among several groups. We will demonstrate how to perform a one-way ANOVA in R step-by-step and how to interpret the ANOVA results.
One-way ANOVA Example
Does the type of physical therapy affect the recovery time of patients after knee surgery?

A health researcher conducts a study to investigate whether different types of physical therapy affect the recovery time of patients after undergoing knee surgery. The study includes three groups of patients:
- Group 1: Patients receiving no physical therapy (Control)
- Group 2: Patients receiving Standard physical therapy
- Group 3: Patients receiving Advanced physical therapy
After the surgery, the researcher records the number of days it takes for each patient to fully recover. Patients have been randomly recruited from different sites and randomly assigned to each physical therapy group. Recovery time is recorded when the physical therapist certifies full recovery, and the patient feels comfortable in walking. Recovery time is recorded in days.
In this study design, there is one factor (Physical therapy method) with three categories / groups (Standard physical therapy, Advanced physical therapy, No physical therapy or the Control) and the measure (recovery time) is continuous, therefore, a one-way ANOVA would be appropriate to address the research question. If the ANOVA results are significant, post hoc tests can be conducted to identify which specific groups differ from each other. Table 1 shows the recovery time in days of six knee injury patients in three different physical therapy treatments.
| Patient | Group | Recovery Time |
|---|---|---|
| Patient 1 | Control | 34 |
| Patient 2 | Control | 36 |
| Patient 3 | Standard | 24 |
| Patient 4 | Advanced | 28 |
| Patient 5 | Standard | 23 |
| … | … | … |
The researcher is interested in knowing if the physical therapy method has an effect on the recovery time of the knee injury patients and if positive, which therapy method is more effective than the others. The researcher enters the data in the SPSS program in the hospital computer lab. The data for this example can be downloaded in the in CSV format.
Analysis: One-way ANOVA in R
In the first step, we prepare the data in a spreadsheet and save it as CSV. The data requires assignment of group membership to each participant in one column (long format data structure). So, we will create three columns (variables) in a spreadsheet, including Patient, the Group the patient receives physical therapy in (Control, Standard, Advanced), and the Recovery time.
Once the variables are created, we can read the data file (saved as CSV) into the R Studio environment. First, we produce some descriptive statistics, such as the mean score and the standard deviations for each group. Table 2 shows the mean score and the standard deviation values for the different treatment methods.
| Group | Mean | SD |
|---|---|---|
| Control | 34.60 | 2.160 |
| Standard | 24.72 | 2.112 |
| Advanced | 16.48 | 2.023 |
In Table 2 we can see that the mean recovery times for the Control, Standard, and the Advanced groups are 34.60 days, 24.72 days, and 16.48 days, respectively. We guess that the differences between the average scores are noticeable. In particular, the shortest recovery time happens in the Advanced physical therapy method. The next step is to perform a one-way ANOVA to determine if the differences in means are statistically significant.
To perform a one-way ANOVA on the data, we use the base R aov() function, which stands for analysis of variance. Assuming the data is in the long format, we can use the formula notation in our code, using the format y ~ x (read as “y as a function of x”, or y predicted by x). This is in line with the mathematical notation where y is the dependent variable (in our case the students’ scores), and x is the independent variable (the effect of group). The following code in Listing 1 shows the formula approach to perform one-way ANOVA in R assuming the variances are equal across the groups and other assumptions are met, including independence of observations, normality, and constant variance.
dfPhysTherapy <- read.csv("dfPhysicalTherapy.csv")
modelResults <- aov(RecovTime ~ Method, data = dfPhysTherapy)
summary(modelResults)
Df Sum Sq Mean Sq F value Pr(>F)
Method 2 4115 2057.7 466.9 <2e-16 ***
Residuals 72 317 4.4
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1In the output of the one-way ANOVA, the factor Method is listed on the first row of the table. ANOVA results are compared against the F distribution. The F value is given in the table, which is 466.9, which is statistically significant with a p value of p < 0.01, implying that overall, the Method factor (physical therapy method) has a significant effect on the dependent variable (recovery time).
The next step would be to see between which physical therapy groups such a significant difference lies that makes the overall model statistically significant. We can use a post hoc test to answer this question. A post hoc test is used after an overall (or omnibus) model is found statistically significant.
One common post hoc test for ANOVA is the Tukey HSD test. Tukey HSD test compares groups in pairs and tells if the pair comparisons are significant. For example, Tukey HSD compares Control group with Standard group, Control group with Advanced group, and Standard group with Advanced group. We can use the built-in TukeyHSD() function from base R to run this test. This function requires the ANOVA model object as input, which we created previously and named at modelResults. Listing 2 includes the R code to run Tukey HSD post hoc test.
TukeyHSD(modelResults)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = RecovTime ~ Method, data = dfPhyTherapy)
$Method
diff lwr upr p adj
Control-Advanced 18.12 16.699094 19.540906 0
Standard-Advanced 8.24 6.819094 9.660906 0
Standard-Control -9.88 -11.300906 -8.459094 0The Tukey HSD has made three pair-comparisons: Control-Advanced, Standard-Advanced, and Standard-Control. The column diff shows the difference between the mean scores of the group being compared. We want to know if such differences between the mean scores of the groups are statistically significant. We can either look at the 95% confidence intervals (lower and upper values in the third and fourth columns) or the adjusted p value in the last column (p adj).
According to the multiple comparison table above, there are three pairwise comparisons: between Control and Standard (difference in means = 9.88), between Control and Advanced (difference in means = 18.12), and between Standard and Advanced (difference in means = 8.24). It implies that patients in Advanced physical therapy treatment method had statistically and significantly shorter recovery time than patients in the Standard treatment method and the Control group. In addition, patients in the Standard treatment group had a statistically significantly shorter recovery time than the Control group, but not shorter than the Advanced group.