Data Interview Question

Explaining Pearson's Correlation Range

bugfree Icon

Hello, I am bugfree Assistant. Feel free to ask me for any question related to this problem

Solution & Explanation

Pearson's correlation coefficient, denoted as rr, is a measure of the linear relationship between two variables XX and YY. Its value ranges from -1 to 1. To understand why this is the case, we can delve into the mathematical underpinnings of the formula and the principles that confine rr within this range.

Mathematical Definition

The correlation coefficient rr is defined as:

corr(X,Y)=Cov(X,Y)σXσY\text{corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}

Where:

  • Cov(X,Y)\text{Cov}(X, Y) is the covariance between XX and YY.
  • σX\sigma_X and σY\sigma_Y are the standard deviations of XX and YY respectively.

Covariance and Variance

Covariance measures how much two random variables change together, while variance measures how much a single variable deviates from its mean. The covariance can be expressed as:

Cov(X,Y)=E[(XμX)(YμY)]\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]

Where EE denotes the expected value, and μX\mu_X and μY\mu_Y are the means of XX and YY respectively.

Cauchy-Schwarz Inequality

The Cauchy-Schwarz inequality states:

E[(XμX)(YμY)]2E[(XμX)2]E[(YμY)2]|E[(X - \mu_X)(Y - \mu_Y)]|^2 \leq E[(X - \mu_X)^2]E[(Y - \mu_Y)^2]

This can be rewritten in terms of variance:

Cov(X,Y)2Var(X)Var(Y)|\text{Cov}(X, Y)|^2 \leq \text{Var}(X) \cdot \text{Var}(Y)

Taking the square root of both sides gives:

Cov(X,Y)σXσY|\text{Cov}(X, Y)| \leq \sigma_X \sigma_Y

Deriving the Range

Substituting this result into the correlation formula:

1Cov(X,Y)σXσY1-1 \leq \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \leq 1

Thus, the correlation coefficient rr is bounded between -1 and 1.

Geometric Interpretation

Pearson's correlation can also be interpreted geometrically as the cosine of the angle θ\theta between the centered data vectors XμXX - \mu_X and YμYY - \mu_Y:

corr(X,Y)=cos(θ)\text{corr}(X, Y) = \cos(\theta)

Since the range of cosine is [1,1][-1, 1], this provides an intuitive geometric reason why rr must lie between -1 and 1.

Conclusion

The Pearson correlation coefficient is a standardized measure of the linear relationship between two variables, confined to the range [-1, 1] due to the properties of covariance, variance, and the Cauchy-Schwarz inequality. This range is further reinforced by the geometric interpretation of correlation as the cosine of the angle between two vectors.