Q-Q Plots from Scratch

In this Statistics Note, we learn how to create Q-Q plots from the scratch in R. A quantile-quantile plot (Q-Q plot for short) is a scatter plot that shows how one empirical distribution approximates a theoretical distribution. Before we dive into QQ plots, let’s clarify an important concept: quantiles.

What is a quantile?

A quantile is the value x in a distribution such that the cumulative distribution function (CDF) at x equals a given probability p, i.e.,

F(xi)=piF(x_i) = p_i

For example:

  • The median is the 0.5 quantile (it splits the data into two equal halves).
  • The first quartile (Q1) is the 0.25 quantile.
  • The third quartile (Q3) is the 0.75 quantile.

Quantiles help us understand how data is distributed and are the foundation of Q-Q plots. In an empirical distribution, if we sort the values, the observed values in the distribution are the sample quantiles. To assign each quantile a probability value, we can then create n points on (0, 1) interval as the corresponding probability values of those quantiles (each quantile needs a corresponding probability value, so we construct as many probabilities as the sample size of the empirical distribution.)

What is a Q-Q Plot?

Q-Q plots are often used to check the normal distribution of observed data versus the hypothetical normal distribution using the quantiles of the data. In addition, Q-Q plots can be used to compare any empirical distribution versus any other theoretical continuous distribution, such as exponential, gamma, and chi. The empirical distribution is assumed in agreement with the theoretical distribution if their quantiles fall on a line.

Q-Q plot algorithms are implemented in common statistical analysis software, such as R, SPSS, and SAS. However, it is good practice to implement the Q-Q plotting algorithm in explicit steps for learning purposes or where some theoretical distributions (such as the smallest extreme value distribution) may not be available in the software.

Creating Q-Q Plots in R Step by Step

We can compare the empirical distribution of our collected data with a theoretical distribution, such as the normal distribution, in the following steps (Wicklin, 2013):

  1. Sort the collected data.
  2. Create a probability vector that has many probabilities as our sample size  (e.g., 50 probabilities if our sample size is 50).
  3. Compute the theoretical quantiles of those probabilities from Step 2 using a quantile function in software.
  4. Use a scatter plot to plot the sorted data (i.e., empirical quantiles) against the theoretical quantiles (from Step 3).

In case of ties, overlapping points will be created at the X coordinate. In Listing 1, we implement these steps in R to check how closely arsenic level data approximates the standard normal distribution. The data can be downloaded from here, or directly using the R code in Listing 1.

Listing 1: R code to drwa a Q-Q plot in steps.
# Q-Q plot step-by-step
# We want to see if our data approximates normal distribution
dfArsenic <- read.csv("https://www.statisticswithr.com/datasets/dsArsenicLevel.csv")

# Step 1: sort the observed data
arsenicLevel <- sort(dfArsenic$Arsenic_Level)

# Step 2: compute n evenly spaced points in the interval (0,1)
n <- length(arsenicLevel)
p <- (1:n) / (n + 1)

# Step 3: compute the inverse CDF (quantiles) of the evenly spaced points
q <- qnorm(p)

# Step 4: Plot sorted observed data versus the quantiles
plot(q, arsenicLevel,
     xlab="Theoretical Quantiles",
     ylab="Arsenic Quantiles",
     main="QQ Plot of Arsenic Levels vs Normal Distribution")
     

Figure 1 shows the Q-Q plot produced from R code in Listing 1.

Q-Q plots from scratch
Figure 1: Q-Q plot for normal distribution.

We can conclude that our sample data approximates a normal distribution.

References

Almeida, A., Loy, A., & Hofmann, H. (2018). ggplot2 Compatible Quantile-Quantile Plots in R.

Wicklin, R. (2013). Simulating data with SAS. SAS Institute.

Wilk, M. B. & Gnanadesikan, R. (1968). Probability plotting methods for the analysis of data. Biometrika55(1), 1-17.

Scroll to Top