Pearson Correlation Coefficient 📂Mathematical Statistics

Pearson Correlation Coefficient

Definition ¹

For two random variables $X, Y$ , the term $\rho = \rho (X,Y)$ defined as follows is called the Pearson correlation coefficient. $\rho = { {\operatorname{Cov} (X,Y)} \over {\sigma_X \sigma_Y} }$

$\sigma_{X}$ and $\sigma_{Y}$ are the standard deviations of $X$ and $Y$ , respectively.

Explanation

The (Pearson) Correlation coefficient is a measure used to determine whether two variables have a (linear) correlation. If it is close to $1$ or $–1$ , it is considered as having a correlation, and if it is $0$ , it is considered as having none.

It is important to note that correlation and independence are not the same concept. Correlation merely checks if two variables form a linear graph. Not having a correlation doesn’t necessarily mean independence. However, if they are independent, they are uncorrelated. This only holds true when the two variables follow a normal distribution.

Properties

The Pearson correlation coefficient does not exceed $[-1,1]$ . That is, $– 1 \le \rho \le 1$

Proof

We intend to introduce two methods of proof.

Proof using the Cauchy-Schwarz inequality

$\rho = { {\operatorname{Cov} (X,Y)} \over {\sigma_X \sigma_Y} } = {1 \over n} \sum_{k=1}^{n} { \left( { { x_k - \mu_{X} } \over {\sigma_X} } \right) \left( { { y_k - \mu_{Y} } \over {\sigma_Y} } \right) }$ Squaring both sides gives $\rho ^2 = {1 \over {n^2} } \left\{ \sum_{k=1}^{n} { \left( { { x_k - \mu_{X} } \over {\sigma_X} } \right) \left( { { y_k - \mu_{Y} } \over {\sigma_Y} } \right) } \right\} ^ 2$

Cauchy-Schwarz inequality: $({a}^{2}+{b}^{2})({x}^{2}+{y}^{2})\ge { (ax+by) }^{ 2 }$

By the Cauchy-Schwarz inequality, ${1 \over {n^2} } \left\{ \sum_{k=1}^{n} { \left( { { x_k - \mu_{X} } \over {\sigma_X} } \right) \left( { { y_k - \mu_{Y} } \over {\sigma_Y} } \right) } \right\} ^ 2 \le {1 \over {n^2} } \sum_{k=1}^{n} { \left( { { x_k - \mu_{X} } \over {\sigma_X} } \right) ^ 2 } \sum_{k=1}^{n} { \left( { { y_k - \mu_{Y} } \over {\sigma_Y} } \right) ^ 2 }$ Rearranging the terms on the right gives $\begin{align*} & {1 \over {n^2} } \sum_{k=1}^{n} { \left( { { x_k - \mu_{X} } \over {\sigma_X} } \right) ^ 2 } \sum_{k=1}^{n} { \left( { { y_k - \mu_{Y} } \over {\sigma_Y} } \right) ^ 2 } \\ =& {1 \over { {\sigma_X}^2 {\sigma_Y}^2 } } \sum_{k=1}^{n} { \left( { { x_k - \mu_{X} } \over { \sqrt{n} } } \right) ^ 2 \sum_{k=1}^{n} \left( { { y_k - \mu_{Y} } \over {\sqrt{n}} } \right) ^ 2 } \\ =& {1 \over { {\sigma_X}^2 {\sigma_Y}^2 } } {\sigma_X}^2 {\sigma_Y}^2 \\ =& 1 \end{align*}$ Since $\rho ^2 \le 1$ , $-1 \le \rho \le 1$

■

Proof using the definition of covariance

Let $\Var(Y)={ \sigma _ Y }^2, \Var(X)={ \sigma _ X }^2$ and $\displaystyle Z= \frac { Y }{ \sigma _Y } - \rho \frac { X }{ \sigma _X }$ , then according to the definition of covariance, $\begin{align*} \Var(Z)&=\frac { 1 }{ { \sigma _ Y }^2 }\Var(Y)+\frac { { \rho ^ 2 } }{ { \sigma _ X }^2 }\Var(X)-2\frac { \rho }{ { \sigma _X } { \sigma _Y } }\operatorname{Cov}(X,Y) \\ =& \frac { 1 }{ { \sigma _ Y }^2 }{ \sigma _ Y }^2+\frac { { \rho ^ 2 } }{ { \sigma _ X }^2 }{ \sigma _ X }^2-2\rho \cdot \rho \\ &=1+{ \rho ^ 2 }-2{ \rho ^ 2 } \\ &=1-{ \rho ^ 2 } \end{align*}$ Since $\Var(Z)\ge 0$ , $\begin{align*} 1-{ \rho ^ 2 }\ge 0 \implies& { \rho ^ 2 }-1\le 0 \\ \implies& (\rho +1)(\rho –1)\le 0 \\ \implies& -1\le \rho \le 1 \end{align*}$

■