ANCOVA (analysis of covariance) is a widely used statistical method for comparing the mean values of multiple groups on a continuous outcome while adjusting for the influence of a related continuous variable. ANCOVA extends ANOVA (analysis of variance) by integrating a continuous independent variable into the model to minimize bias and control for potential confounding effects from that variable.
ANCOVA can be applied within both the ANOVA framework and the linear modeling framework, such as multiple regression. In a separate module, I presented ANCOVA as an extension of ANOVA, and hence using the ANOVA framework. In this post, I present ANCOVA in the linear modeling approach using the same data to compare the results with the ANOVA approach.
ANCOVA as a Linear Model
At its core, ANCOVA is simply a linear model with both categorical and continuous predictors. Although it is often introduced as an extension of ANOVA, the method is more naturally understood as a special case of multiple regression. In this formulation, the group variable enters the model as a factor (represented through dummy variables), while the covariate enters as a continuous predictor.
When we fit a model of the form
the coefficient represents the adjusted group difference—that is, the difference in expected outcomes between groups after controlling for the covariate . This is precisely what ANCOVA aims to estimate. Instead of computing adjusted means through sums of squares (as in classical ANOVA), the linear model computes the same adjustment through regression coefficients.
Viewing ANCOVA through the linear modeling lens also clarifies its assumptions. The requirement of homogeneity of regression slopes—that the covariate–outcome relationship is the same across groups—corresponds to the absence of a significant interaction term between the factor and the covariate. If such an interaction exists, the model is no longer an ANCOVA but a more general regression model with group‑specific slopes. Likewise, the assumption of linearity is simply the standard linear model requirement that the covariate relates linearly to the outcome. These assumptions are not unique to ANCOVA; they are inherited directly from the linear model framework.
Finally, the linear model perspective highlights the interpretability and flexibility of ANCOVA. Because the method is embedded in the general regression framework, it naturally extends to multiple covariates, interactions, and diagnostic tools such as residual plots and Q–Q plots.
Adjusting the Means for Covariates
Adjusted means follow directly from the fitted linear model. In the ANCOVA model
each group has its own regression line with intercept but a shared or common slope . To obtain the adjusted mean for group , we evaluate that group’s regression line at a common reference value of the covariate—typically the overall mean . This yields
which is algebraically equivalent to the classical adjustment formula
ANCOVA in R
In Listing 1, we perform an ANCOVA model using linear modeling. This framework is similar to a multiple regression, except that we are not interested in the effect of covariate as a main predictor. The research question is whether female and male athletes differ in high‑jump performance after adjusting for height as a covariate. The data can be downloaded from here or directly using the R code in Listing 1.
# ANCOVA through linear modeling
# Read in data
dfHighJump <- read.csv(https://statisticswithr.com/datasets/df_high_jump_hurdle.csv)
fitAncova <- lm(highjump_cm ~ sex + height_cm, data = dfHighJump)
summary(fitAncova)
# Output 1: lm results for ANCOVA
Call:
lm(formula = highjump_cm ~ sex + height_cm, data = dfHighJump)
Residuals:
Min 1Q Median 3Q Max
-16.7643 -3.9787 -0.3816 3.8552 14.6310
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -37.4176 30.5088 -1.226 0.227
sexmale 1.3734 3.0474 0.451 0.655
height_cm 0.9604 0.1841 5.216 6.32e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.67 on 39 degrees of freedom
Multiple R-squared: 0.6344, Adjusted R-squared: 0.6157
F-statistic: 33.84 on 2 and 39 DF, p-value: 3.004e-09The results in Listing 1 show that after adjusting for the heights of the athletes (the covariate in the model), no statistically significant difference is found between female and male athletes.
We can now use the shared slope (the effect of the covariate) to calculate the adjusted means for female and male athletes. Listing 2 shows the R code for calculating the adjusted means using the unadjusted means and the shared slope.
# ANCOVA through linear modeling
library(dplyr)
# Read in data
dfHighJump <- read.csv(https://statisticswithr.com/datasets/df_high_jump_hurdle.csv)
# Descriptive statistics
# Jump per group
dfHighJump %>%
group_by(sex) %>%
summarise(
mean_jump = mean(highjump_cm, na.rm = TRUE),
sd_jump = sd(highjump_cm, na.rm = TRUE),
n = n()
)
# Height per group
dfHighJump %>%
group_by(sex) %>%
summarise(
mean_height = mean(height_cm, na.rm = TRUE),
sd_height = sd(height_cm, na.rm = TRUE),
n = n()
)
# Totoal height mean
mean_height_total <- mean(dfHighJump$height_cm)
# Output 1: Mean jump
sex mean_jump sd_jump n
<chr> <dbl> <dbl> <int>
1 female 122. 9.10 21
2 male 135. 8.03 21
# Output 2: Mean height
sex mean_height sd_height n
<chr> <dbl> <dbl> <int>
1 female 166. 5.91 21
2 male 178. 5.54 21
# Output 3: Mean height total (both groups)
mean_height_total
[1] 171.6119
We can now calculate adjusted means for female athletes:
For male athletes:
The adjusted difference is now much smaller than the raw difference (13 cm → ~1.5 cm), which is exactly what we expect when height strongly predicts high‑jump performance.
References
Cochran, W.G. (1983). Planning and analysis of observational studies (L. E. Moses & F. Mosteller, Eds.). Wiley.