STATISTICS with R

A comprehensive guide to statistical analysis in R

Logistic Regression in R

Logistic regression is a statistical method that models the relationship between a set of independent variables and one binary dependent variable. A binary variable has two possible outcomes, such as head or tail, success or failure, having hypertension or not having hypertension. Logistic regression models the nonlinear relationship between the independent variables and the binary outcome variable.

Introduction to Logistic Regression

When the relationship between a dependent variable and a set of independent variables is linear and if the dependent variable data are distributed as normal distribution, we can model such a relationship using linear regression. However, in some research studies, our dependent variable is not continuous, such as when our dependent variable represents two discrete values like pass / fail, recovered / not recovered, have diabetes / does not have diabetes, spam / ham. Such variables with two values are called dichotomous or binary variables.

The implication of a dichotomous variable as a dependent variable is that we cannot use the linear regression to model the relationship between a dichotomous variable and other independent variables because the relationship is now nonlinear.

In a linear regression, the mathematical function that links the dependent variable to the independent variable is appropriate for a continuous family of distributions (e.g., normal distribution). However, when the response variable has only two values (dichotomous or binary), we need to replace the normal family distribution with a function that can model binary data. Dichotomous or binary data follow the binomial distribution. The logistic function is an appropriate function to model the binomial distribution data when the dependent variable is dichotomous.

Logistic regression is a regression model that uses the logistic function to model the relationship between a dichotomous (binary) dependent variable and a set of independent variables. For example, a researcher may be interested in examining the relationship between variables such as blood pressure or glucose level in the blood and whether or not the participant has developed diabetes condition. In this example, the dependent variable has two values: diabetes (0) and no diabetes (1). A logistic regression can model such a nonlinear relationship.

In the following sections, we introduce an example data set and demonstrate how to model the relationship between the independent and a dichotomous dependent variable through a simple logistic regression model in R step by step.

Logistic Regression Example

Does blood pressure predict diabetes in female patients?

Figure 0: Can blood pressure predict diabetes in female patients? Photo courtesy: CDC, Unsplash

A public health researcher is curious to know what predictor variables are related to diabetes and if such variables have a strong relationship to indicate a warning predictor for the diabetes condition.

For this purpose, the researcher collects data from a sample of ethnic population on the blood pressure, glucose level, and body mass index (BMI), and whether or not the participants have diabetes (a dichotomous dependent or outcome variable). Table 1 includes the values for blood pressure, glucose level, and body mass index (BMI) and diabetes status (0 = without diabetes; 1 = with diabetes) for five participants. (These data are from the Pima Indians Diabetes Dataset, originally published by the National Institute of Diabetes and Digestive and Kidney Diseases, collected from 768 female participants from Arizona, USA.)

Table 1: Blood Pressure, Glucose Level, BMI, and Diabetes Status
Patient	Blood Pressure	Glucose Level	BMI	Diebetese Status
Patient 1	66	89	28.1	0
Patient 2	50	78	31.0	1
Patient 3	70	197	30.5	1
Patient 4	60	89	30.1	1
Patient 5	72	166	25.8	1
…	…	…	…	…

The researcher downloads the data from the official portal into the lab computer and saves the data in the CSV format. In this example, we select only blood pressure as the independent (predictor) variable. Otherwise, the complete dataset has more variables than listed in Table 1. The complete data set for this example can be downloaded from here.

Analysis: Logistic Regression in R

In the first step, data are read into the RStudio program using the read.csv() function. In this example, the dependent variable is the diabetes status of the patients, and the independent (predictor) variable is the blood pressure of the participants. Figure 2 shows the mean blood pressure in sample patients with and without diabetes (74.5 and 68.2, respectively.)

To conduct a logistic regression between the Diabetes (as outcome) and Blood pressure (as predictor), we use the glm() function in R. The glm() function is a built-in R function, and therefore, we do not need to install and call a package to use this function. We use the function notation (~) to show the relationship between the dependent variable (Outcome, on the left-hand side) and the independent variable (Blood pressure, on the right-hand side).

The following R code in Listing 1 shows how to perform a simple logistic regression to model the relationship between one dichotomous random variable (Diabetes condition), and one continuous independent variable (Blood pressure). Note that the original data has missing values shown as 0. We need to convert 0 (as missing values in the original data in some independent variables) to NA, which is the code that R recognizes as missing values (in addition to the omission of missing values, there are statistical methods to impute them if the reason for missingness can be determined). We do not change the 0 values in the outcome variable as these 0’s may be valid values.

Listing 1: R code to run logistic regression.

library(dplyr)
dfDiabetes <- read.csv("dsPimaDiabetes.csv")
 
# Convert missing values of 0 to NA
dfDiabetes <- dfDiabetes %>% mutate(across(-Outcome, ~replace(., . == 0, NA)))
dfDiabetes <- na.omit(dfDiabetes) # Remove missing values
 
# Logistic regression model with one predictor / independent variable
lrModelDiabetes <- glm(Outcome ~ BloodPressure, data = dfDiabetes, family = binomial)
 
summary(lrModelDiabetes)

# Output
Call:
glm(formula = Outcome ~ BloodPressure, family = binomial, data = dfDiabetes)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -3.93722    0.77384  -5.088 3.62e-07 ***
BloodPressure  0.04528    0.01057   4.282 1.85e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 426.34  on 335  degrees of freedom
Residual deviance: 405.86  on 334  degrees of freedom
AIC: 409.86

Number of Fisher Scoring iterations: 4

# Convert to odds 
exp(coef(lrModelDiabetes))
  (Intercept) BloodPressure 
   0.01950234    1.04632373

library(dplyr)
dfDiabetes <- read.csv("dsPimaDiabetes.csv")
 
# Convert missing values of 0 to NA
dfDiabetes <- dfDiabetes %>% mutate(across(-Outcome, ~replace(., . == 0, NA)))
dfDiabetes <- na.omit(dfDiabetes) # Remove missing values
 
# Logistic regression model with one predictor / independent variable
lrModelDiabetes <- glm(Outcome ~ BloodPressure, data = dfDiabetes, family = binomial)
 
summary(lrModelDiabetes)

# Output
Call:
glm(formula = Outcome ~ BloodPressure, family = binomial, data = dfDiabetes)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -3.93722    0.77384  -5.088 3.62e-07 ***
BloodPressure  0.04528    0.01057   4.282 1.85e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 426.34  on 335  degrees of freedom
Residual deviance: 405.86  on 334  degrees of freedom
AIC: 409.86

Number of Fisher Scoring iterations: 4

# Convert to odds 
exp(coef(lrModelDiabetes))
  (Intercept) BloodPressure 
   0.01950234    1.04632373

The code in Listing 1 begins with importing the data into RStudio. Next, we convert the 0 values (in the independent variables only) to NA. The dependent variable also has 0, but those 0’s represent absence of diabetes condition and are valid values, and so we must keep them unchanged.

To conduct logistic regression in R, we use the glm() function. The parameters of the glm() function include the name of the dependent variable (Outcome, which is the diabetes condition, a binary variable), the name of the independent variable (Blood pressure, a continuous variable), the name of the data set the two variables are in, and the family of distribution (binomial, which is used for dichotomous or binary variables). The model and its parameters are given a name, lrModelDiabetes (any arbitrary name).

The logistic regression results from the glm() function modeling the relationship between Diabetes (named as Outcome in the code in Listing 1) and Blood pressure outputs a fair amount of information. We explain the most important lines of the logistic regression output results.

The Call line shows our formula model. Coefficients are the most important information we are interested in.

The intercept is the point on the y-axis (dependent variable) where the curve of best fits crosses (and where the x value, or the independent variable value is zero). In our example, the intercept is -3.937, meaning that when Blood pressure is zero, we theoretically expect the log odds of having diabetes (compared to not having diabetes) to be -3.937.

If you notice, we interpreted both the intercept and the slope (the effect of blood pressure) using the term log odds. Log odds in short is also known as logit, and hence the name logistic regression. This means that the scale of the original dichotomous data has changed to log odds, or logit in the R output. However, a more intuitive way to interpret the intercept and the coefficients is to use the odds. Therefore, we transform log odds to odds using the exponentiation transformation (called antilog in old school days). Exponentiation eliminates the log in log odds and leaves us with odds only.

Exponentiation is shown in mathematics and in R with the notation exp(). Also, in math we use the notation e^x, where e is the base of natural log with approximate value 2.718. The last line of the R code in Listing 1 shows we have used the function exp() to transform the coefficients from the log odds scale to odds. Interpreted as odds, the intercept now reads as, “when the Blood pressure is zero, the odds of having diabetes increases by 0.019”. Of course, the blood pressure cannot be zero in living humans, so we may ignore interpretation of the intercept in this example.

The slope (Blood pressure coefficient) on the odds scale is 1.046 interpreted as, “for one unit increase in Bood pressure, the odds of having diabetes multiplies by 1.046.” We can convert odds to percentage increase by dividing the odds by 1. In this case, the odds of having diabetes increase by around 5% when blood pressure increases 1 unit. Generally, if the odds value is larger than 1, it shows an increase, while if the odds value is less than 1, it shows a decrease.

Similar to linear regression, we can write the result of the logistic regression analysis in R in terms of the relationship between the Blood pressure and Diabetes as the following equation (model):

Expected log odds of Diabetes = -3.937 + 0.045*(Blood pressure).

Remember that the equation above returns an expected value in log odds (or logit). To obtain the odds, we need to exponentiate the result, as demonstrated in the following prediction example.

Suppose a female patient’s blood pressure is 90. We predict the log odds of her having diabetes to be:

Expected log odds of Diabetes = -3.937 + 0.045*(90) = 0.113

So, the log odds of having diabetes for a female patient with blood pressure of 90 is estimated to be 0.113. But we are more interested in the odds of developing diabetes. So, we exponentiate the log odds to remove log and get the odds only. The exp(0.113) = 1.119, which is larger than 1, which means an increase. So, the odds of developing diabetes for a female patient with blood pressure of 90 is 1.119.

In addition to odds, we can also calculate the probability of having diabetes with a blood pressure of 90. To calculate our prediction in terms of probability, we use the following equation:

Probability = odds / (1 + odds)

In our example, we calculated odds to be 1.119, so the probability of having diabetes for a female patient with blood pressure of 90 is,

Probability = 1.119 / (1 + 1.119) = 0.528

The probability of having diabetes for a patient with a blood pressure of 90 is about 0.528.

But how about the overall fit of the model? Does the model fit the data well? How far off is the model from a perfect model? We can see some information in the Null deviance and Residual deviance lines of the R output. Null here means a regression model with no predictors, just the intercept. When we add a predictor variable, this value changes and it is called Residual deviance.

The Null deviance in our logistic regression model is 426.4. If we add a good predictor, we expect the deviance from the perfect model to decrease. The Residual shows that by adding a predictor (Blood pressure), deviance has decreased by about 20 units to 405.86. So, the fit of the model has improved compared to a model with no predictors, implying that it’s good to retain the predictor. But is 20 units reduction statistically a significant reduction?

We can test if the reduction in deviance is statistically significant using the model difference test based on log likelihood test. In R, we can pass our logistic regression model to the built-in anova() function and test if the deviance reduction is statistically significant. Listing 2 shows how to perform model difference test in R to see if two logistic regression models (one without any predictors and one with a predictor) are significantly different from each other in fitting the data using the deviance metric.

Listing 2: R code to test model deviance in logistic regression.

> anova(lrModelDiabetes, test = "Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: Outcome

Terms added sequentially (first to last)


              Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                            335     426.34              
BloodPressure  1   20.476       334     405.86 6.038e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> anova(lrModelDiabetes, test = "Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: Outcome

Terms added sequentially (first to last)


              Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                            335     426.34              
BloodPressure  1   20.476       334     405.86 6.038e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In Listing 2, the function anova() takes as input our logistic regression model that we conducted in Listing 1, and calculates the difference between the Null deviance and the Residual deviance. Because this is a likelihood ratio-based test, the distribution of the differences is also chi-squared, hence using the test = “Chisq” parameter.

In the results of deviance testing for model goodness of fit, we can see the same values from Listing 1 for the Null and Residual (model with predictor, which is BloodPressure in the output), their difference in fit (20.476) and in the last column if this difference is statistically significant (which it is). Therefore, we conclude that the model with our predictor (Blood pressure) is a significant improvement on the model without our predictor (the Null model). In other words, our logistic regression model fits the data well.

Appendix

Mathematics of the Logistic Function and Logistic Regression

The canonical form of the logistic function is:

f(x) = \frac{L}{1 + e^{-k(x – x_0)}}

where,

L is the curve’s maximum value,
K is the logistic growth rate or steepness of the curve,
x₀ is the x-value of the sigmoid curve’s midpoint.

If we set L = 1, K = 1, and x₀ = 0 the standard form of the logistic function becomes:

f(x) = \frac{1}{1 + e^{-x}}