Pearson Correlation Coefficient 📂Mathematical Statistics

Pearson Correlation Coefficient

Definition ¹

For two random variables $X, Y$ , the following $\rho = \rho (X,Y)$ , defined as the Pearson Correlation Coefficient, is: $\rho = { {\operatorname{Cov} (X,Y)} \over {\sigma_X \sigma_Y} }$

$\sigma_{X}$ , $\sigma_{Y}$ are the standard deviations of $X$ , $Y$ respectively.

Explanation

The Pearson Correlation Coefficient is a measure of whether two variables have a (linear) correlation. If close to $1$ or $–1$ , it is considered to have a correlation, and if $0$ , it is considered to have none.

It is important to note that correlation and independence are not the same concept. Correlation only checks if the two variables form a linear graph. Lack of correlation does not necessarily mean independence, but if they are independent, it can be said there’s no correlation. This reverse is only true when the two variables follow a normal distribution.

Properties

The Pearson correlation coefficient does not exceed $[-1,1]$ . That is, $– 1 \le \rho \le 1$

Proof

Two methods of proof will be introduced.

Proof using Cauchy-Schwarz inequality

$\rho = { {\operatorname{Cov} (X,Y)} \over {\sigma_X \sigma_Y} } = {1 \over n} \sum_{k=1}^{n} { \left( { { x_k - \mu_{X} } \over {\sigma_X} } \right) \left( { { y_k - \mu_{Y} } \over {\sigma_Y} } \right) }$ Squaring both sides gives $\rho ^2 = {1 \over {n^2} } \left\{ \sum_{k=1}^{n} { \left( { { x_k - \mu_{X} } \over {\sigma_X} } \right) \left( { { y_k - \mu_{Y} } \over {\sigma_Y} } \right) } \right\} ^ 2$

Cauchy-Schwarz inequality: $({a}^{2}+{b}^{2})({x}^{2}+{y}^{2})\ge { (ax+by) }^{ 2 }$

By the Cauchy-Schwarz inequality ${1 \over {n^2} } \left\{ \sum_{k=1}^{n} { \left( { { x_k - \mu_{X} } \over {\sigma_X} } \right) \left( { { y_k - \mu_{Y} } \over {\sigma_Y} } \right) } \right\} ^ 2 \le {1 \over {n^2} } \sum_{k=1}^{n} { \left( { { x_k - \mu_{X} } \over {\sigma_X} } \right) ^ 2 } \sum_{k=1}^{n} { \left( { { y_k - \mu_{Y} } \over {\sigma_Y} } \right) ^ 2 }$ Rearranging the right side gives $\begin{align*} & {1 \over {n^2} } \sum_{k=1}^{n} { \left( { { x_k - \mu_{X} } \over {\sigma_X} } \right) ^ 2 } \sum_{k=1}^{n} { \left( { { y_k - \mu_{Y} } \over {\sigma_Y} } \right) ^ 2 } \\ =& {1 \over { {\sigma_X}^2 {\sigma_Y}^2 } } \sum_{k=1}^{n} { \left( { { x_k - \mu_{X} } \over { \sqrt{n} } } \right) ^ 2 \sum_{k=1}^{n} \left( { { y_k - \mu_{Y} } \over {\sqrt{n}} } \right) ^ 2 } \\ =& {1 \over { {\sigma_X}^2 {\sigma_Y}^2 } } {\sigma_X}^2 {\sigma_Y}^2 \\ =& 1 \end{align*}$ Since $\rho ^2 \le 1$ , $-1 \le \rho \le 1$

■

Proof using the definition of covariance

Setting $\Var(Y)={ \sigma _ Y }^2, \Var(X)={ \sigma _ X }^2$ , $\displaystyle Z= \frac { Y }{ \sigma _Y } - \rho \frac { X }{ \sigma _X }$ to be the definition of covariance gives $\begin{align*} \Var(Z)&=\frac { 1 }{ { \sigma _ Y }^2 }\Var(Y)+\frac { { \rho ^ 2 } }{ { \sigma _ X }^2 }\Var(X)-2\frac { \rho }{ { \sigma _X } { \sigma _Y } }\operatorname{Cov}(X,Y) \\ =& \frac { 1 }{ { \sigma _ Y }^2 }{ \sigma _ Y }^2+\frac { { \rho ^ 2 } }{ { \sigma _ X }^2 }{ \sigma _ X }^2-2\rho \cdot \rho \\ &=1+{ \rho ^ 2 }-2{ \rho ^ 2 } \\ &=1-{ \rho ^ 2 } \end{align*}$ Because of $\Var(Z)\ge 0$ , $\begin{align*} 1-{ \rho ^ 2 }\ge 0 \implies& { \rho ^ 2 }-1\le 0 \\ \implies& (\rho +1)(\rho –1)\le 0 \\ \implies& -1\le \rho \le 1 \end{align*}$

■

Hogg et al. (2013). Introduction to Mathematical Statistics (7th Edition): p104. ↩︎