How to Perform ANCOVA Using Linear Models (with R Examples)

ANCOVA (analysis of covariance) is a widely used statistical method for comparing the mean values of multiple groups on a continuous outcome while adjusting for the influence of a related continuous variable. ANCOVA extends ANOVA (analysis of variance) by integrating a continuous independent variable into the model to minimize bias and control for potential confounding effects from that variable.

ANCOVA can be applied within both the ANOVA framework and the linear modeling framework, such as multiple regression. In a separate module, I presented ANCOVA as an extension of ANOVA, and hence using the ANOVA framework. In this post, I present ANCOVA in the linear modeling approach using the same data to compare the results with the ANOVA approach.

ANCOVA as a Linear Model

At its core, ANCOVA is simply a linear model with both categorical and continuous predictors. Although it is often introduced as an extension of ANOVA, the method is more naturally understood as a special case of multiple regression. In this formulation, the group variable enters the model as a factor (represented through dummy variables), while the covariate enters as a continuous predictor.

When we fit a model of the form

Yi=β0+β1Groupi+β2Xi+εi,εi𝒩(0,σ2).Y_i = \beta_0 + \beta_1 \,\text{Group}_i + \beta_2 \,X_i + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2).

the coefficient β1 represents the adjusted group difference—that is, the difference in expected outcomes between groups after controlling for the covariate X. This is precisely what ANCOVA aims to estimate. Instead of computing adjusted means through sums of squares (as in classical ANOVA), the linear model computes the same adjustment through regression coefficients.

Viewing ANCOVA through the linear modeling lens also clarifies its assumptions. The requirement of homogeneity of regression slopes—that the covariate–outcome relationship is the same across groups—corresponds to the absence of a significant interaction term between the factor and the covariate. If such an interaction exists, the model is no longer an ANCOVA but a more general regression model with group‑specific slopes. Likewise, the assumption of linearity is simply the standard linear model requirement that the covariate relates linearly to the outcome. These assumptions are not unique to ANCOVA; they are inherited directly from the linear model framework.

Finally, the linear model perspective highlights the interpretability and flexibility of ANCOVA. Because the method is embedded in the general regression framework, it naturally extends to multiple covariates, interactions, and diagnostic tools such as residual plots and Q–Q plots.

Adjusting the Means for Covariates

Adjusted means follow directly from the fitted linear model. In the ANCOVA model

Yi=β0+β1Groupi+β2Xi+εiY_i = \beta_0 + \beta_1 \,\text{Group}_i + \beta_2 \,X_i + \varepsilon_i

each group j has its own regression line with intercept β0+β1I(Gi=j) but a shared or common slope β2. To obtain the adjusted mean for group j, we evaluate that group’s regression line at a common reference value of the covariate—typically the overall mean Xˉt. This yields

μjadj=(Yjβ2Xj)+β2Xt\mu_j^{\text{adj}} = \bigl(\bar{Y}_j – \beta_2\,\bar{X}_j\bigr) + \beta_2\,\bar{X}_t

which is algebraically equivalent to the classical adjustment formula

μjadj=Yjβ2(XjXt).\mu_j^{\text{adj}} = \bar{Y}_j – \beta_2\bigl(\bar{X}_j – \bar{X}_t\bigr).

ANCOVA in R

In Listing 1, we perform an ANCOVA model using linear modeling. This framework is similar to a multiple regression, except that we are not interested in the effect of covariate as a main predictor. The research question is whether female and male athletes differ in high‑jump performance after adjusting for height as a covariate. The data can be downloaded from here or directly using the R code in Listing 1.

Listing 1: R code to run ANCOVA as a linear model.
# ANCOVA through linear modeling

# Read in data
dfHighJump <- read.csv(https://statisticswithr.com/datasets/df_high_jump_hurdle.csv)

fitAncova <- lm(highjump_cm ~ sex + height_cm, data = dfHighJump)
summary(fitAncova)

# Output 1: lm results for ANCOVA
Call:
lm(formula = highjump_cm ~ sex + height_cm, data = dfHighJump)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.7643  -3.9787  -0.3816   3.8552  14.6310 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -37.4176    30.5088  -1.226    0.227    
sexmale       1.3734     3.0474   0.451    0.655    
height_cm     0.9604     0.1841   5.216 6.32e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.67 on 39 degrees of freedom
Multiple R-squared:  0.6344,	Adjusted R-squared:  0.6157 
F-statistic: 33.84 on 2 and 39 DF,  p-value: 3.004e-09

The results in Listing 1 show that after adjusting for the heights of the athletes (the covariate in the model), no statistically significant difference is found between female and male athletes.

We can now use the shared slope β2 (the effect of the covariate) to calculate the adjusted means for female and male athletes. Listing 2 shows the R code for calculating the adjusted means using the unadjusted means and the shared slope.

Listing 2: R code to calculate adjusted means.
# ANCOVA through linear modeling

library(dplyr)

# Read in data
dfHighJump <- read.csv(https://statisticswithr.com/datasets/df_high_jump_hurdle.csv)

# Descriptive statistics
# Jump per group
dfHighJump %>%
  group_by(sex) %>%
  summarise(
    mean_jump = mean(highjump_cm, na.rm = TRUE),
    sd_jump   = sd(highjump_cm, na.rm = TRUE),
    n         = n()
  )
  
# Height per group
dfHighJump %>%
  group_by(sex) %>%
  summarise(
    mean_height = mean(height_cm, na.rm = TRUE),
    sd_height   = sd(height_cm, na.rm = TRUE),
    n         = n()
  )

# Totoal height mean
mean_height_total <- mean(dfHighJump$height_cm)

# Output 1: Mean jump
  sex    mean_jump sd_jump     n
  <chr>      <dbl>   <dbl> <int>
1 female      122.    9.10    21
2 male        135.    8.03    21

# Output 2: Mean height
  sex    mean_height sd_height     n
  <chr>        <dbl>     <dbl> <int>
1 female        166.      5.91    21
2 male          178.      5.54    21

# Output 3: Mean height total (both groups)
mean_height_total 
[1] 171.6119

We can now calculate adjusted means for female athletes:

μFadj=YFbw(XFXt)=1220.9604(166171.6119)=1220.9604(5.6119)=122+5.389127.39 cm.\begin{aligned} \mu_F^{\text{adj}} &= \bar{Y}_F – b_w(\bar{X}_F – \bar{X}_t) \\ &= 122 – 0.9604\,(166 – 171.6119) \\ &= 122 – 0.9604\,(-5.6119) \\ &= 122 + 5.389 \\ &\approx 127.39 \text{ cm}. \end{aligned}

For male athletes:

μMadj=YMbw(XMXt)=1350.9604(178171.6119)=1350.9604(6.3881)=1356.134128.87 cm. \begin{aligned} \mu_M^{\text{adj}} &= \bar{Y}_M – b_w(\bar{X}_M – \bar{X}_t) \\ &= 135 – 0.9604\,(178 – 171.6119) \\ &= 135 – 0.9604\,(6.3881) \\ &= 135 – 6.134 \\ &\approx 128.87 \text{ cm}. \end{aligned}

The adjusted difference is now much smaller than the raw difference (13 cm → ~1.5 cm), which is exactly what we expect when height strongly predicts high‑jump performance.

References

Cochran, W.G. (1983). Planning and analysis of observational studies (L. E. Moses & F. Mosteller, Eds.). Wiley.

Scroll to Top