logo

Principal Component Analysis (PCA) in Mathematical Statistics 📂Mathematical Statistics

Principal Component Analysis (PCA) in Mathematical Statistics

Overview

Principal Component Analysis (PCA) has many uses in statistics, such as avoiding multicollinearity in regression analysis and summarizing data, and holds significant importance in machine learning as a way of dimensionality reduction. This post focuses on the theoretical foundations of deriving principal components rather than on practical usage.

Definition 1

Principal Component Analysis

Let’s assume a random vector $\mathbf{X} = \left( X_{1} , \cdots , X_{p} \right)$ is given. Considering the linear combination of random variables $X_{1} , \cdots , X_{p}$ $$ \mathbf{a}_{k}^{T} \mathbf{X} = a_{k1} X_{1} + \cdots + a_{kp} X_{p} = \sum_{l = 1}^{p} a_{kl} X_{l} $$ where for a length $1$ unit vector is denoted as $\mathbf{a}_{k} = \left( a_{k1} , \cdots , a_{kp} \right) \in \mathbb{R}^{p}$, the goal is to maximize the variance of the first $\mathbf{a}_{1}^{T} \mathbf{X}$ $$ \text{Var} \left( \mathbf{a}_{1}^{T} \mathbf{X} \right) $$ and, while satisfying $\text{Cov} \left( \mathbf{a}_{1}^{T} \mathbf{X} , \mathbf{a}_{2}^{T} \mathbf{X} \right) = 0$, also maximize the variance of the second $\mathbf{a}_{2}^{T} \mathbf{X}$ $$ \text{Var} \left( \mathbf{a}_{2}^{T} \mathbf{X} \right) $$ and, for all $l < k$ satisfying $\text{Cov} \left( \mathbf{a}_{l}^{T} \mathbf{X} , \mathbf{a}_{k}^{T} \mathbf{X} \right) = 0$, also maximize the variance of the $k$th $\mathbf{a}_{i}^{T} \mathbf{X}$ $$ \text{Var} \left( \mathbf{a}_{k}^{T} \mathbf{X} \right) $$ Vectors $\mathbf{a}_{1} , \cdots , \mathbf{a}_{p}$ that accomplish this, through which the data is analyzed, are known as Principal Component Analysis, PCA.

Principal Components

Assuming the covariance matrix $\Sigma \in \mathbb{R}^{p \times p}$ of random vector $\mathbf{X}$ has its eigenpairs $\left\{ \left( \lambda_{k} , e_{k} \right) \right\}_{k=1}^{p}$ arranged in a certain order $\lambda_{1} \ge \cdots \ge \lambda_{p} \ge 0$, and $e_{k}$ vectors are $\left\| e_{k} \right\| = 1$, meaning they are normalized. The random variable $Y_{k}$ defined by the inner product between the random vector $\mathbf{X}$ and the $k$th eigenvector $e_{k}$ is called the $k$th Principal Component. $$ Y_{k} := e_{k}^{T} \mathbf{X} $$ The realization $y_{k}$ of $Y_{k}$ is called the $k$th PC Score.


Theorem

Unit vectors that maximize the principal components in PCA are obtained as follows $\mathbf{a}_{k} = e_{k}$. For $k = 1 , \cdots , p$ and $i \ne j$, the variance and covariance of the $k$th principal component are as follows. $$ \begin{align*} \text{Var} \left( Y_{k} \right) =& \lambda_{k} \\ \text{Cov} \left( Y_{i}, Y_{j} \right) =& 0 \end{align*} $$

Proof

The covariance matrix $\Sigma$ can be expanded as follows. $$ \begin{align*} & \Sigma \\ =& \text{Cov} \left( \mathbf{X} \right) \\ =& \begin{pmatrix} \text{Var} \left( X_{1} \right) & \text{Cov} \left( X_{1} , X_{2} \right) & \cdots & \text{Cov} \left( X_{1} , X_{p} \right) \\ \text{Cov} \left( X_{2} , X_{1} \right) & \text{Var} \left( X_{2} \right) & \cdots & \text{Cov} \left( X_{2} , X_{p} \right) \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov} \left( X_{p} , X_{1} \right) & \text{Cov} \left( X_{p} , X_{2} \right) & \cdots & \text{Var} \left( X_{p} \right) \end{pmatrix} \end{align*} $$

Properties of the Covariance Matrix: Given a constant matrix $A \in \mathbb{R}^{k \times p}$ is provided like $(A)_{ij} := a_{ij}$, then $$ \text{Cov} ( A \mathbf{X}) = A \text{Cov} \left( \mathbf{X} \right) A^{T} $$

Defining the orthogonal matrix $P := \begin{bmatrix} e_{1} & \cdots & e_{p} \end{bmatrix} \in \mathbb{R}^{p \times p}$ composed of eigenvectors of $\Sigma$, the covariance of random vector $\mathbf{Y} := P^{T} \mathbf{X}$ can be represented as follows according to the properties. $$ \begin{align*} & \text{Cov} \left( \mathbf{Y} \right) \\ =& \text{Cov} \left( P^{T} \mathbf{X} \right) \\ =& P^{T} \text{Cov} \left( \mathbf{X} \right) P \end{align*} $$

By expanding the first diagonal component of this $\text{Cov} \left( \mathbf{Y} \right)$, it can be found to be equal to the first eigenvalue. $$ \begin{align*} \text{Var} \left( Y_{1} \right) =& e_{1}^{T} \text{Var} \mathbf{X} e_{1} \\ =& e_{1}^{T} \Sigma e_{1} \\ =& \lambda_{1} \end{align*} $$

Quadratic Forms and Eigenvalues of Positive-Definite Matrices: Assuming the eigenpair $\left\{ \left( \lambda_{k} , e_{k} \right) \right\}_{k=1}^{n}$ of a positive-definite matrix $A \in \mathbb{R}^{p \times p}$ is ordered like $\lambda_{1} \ge \cdots \ge \lambda_{n} \ge 0$, the maximum and minimum values of the quadratic form $\mathbf{x}^{T} A \mathbf{x}$ on the unit sphere are as follows. $$ \begin{align*} \max_{\left\| \mathbf{x} \right\| = 1} \mathbf{x}^{T} A \mathbf{x} =& \lambda_{1} & \text{, attained when } \mathbf{x} = e_{1} \\ \min_{\left\| \mathbf{x} \right\| = 1} \mathbf{x}^{T} A \mathbf{x} =& \lambda_{p} & \text{, attained when } \mathbf{x} = e_{p} \end{align*} $$ Meanwhile, for $k = 2, \cdots , p-1$, the following holds. $$ \max_{\substack{\left\| \mathbf{x} \right\| = 1 \\ \mathbf{x} \perp e_{1} , \cdots , e_{k-1} }} \mathbf{x}^{T} A \mathbf{x} = \lambda_{k} \quad \text{, attained when } \mathbf{x} = e_{k} $$

Eventually, this $\lambda_{1}$ is the maximum value of the quadratic form $\mathbf{x}^{T} \Sigma \mathbf{x}$ under the constraint $\left\| \mathbf{x} \right\| = 1$. Summarizing, $$ \text{Var} \left( Y_{1} \right) = \lambda_{1} = \max_{\left\| \mathbf{x} \right\| = 1} \mathbf{x}^{T} \Sigma \mathbf{x} $$ by the same theorem, the variance $\text{Var} \left( Y_{k} \right)$ of the $k$th is equal to the $k$th eigenvalue $\lambda_{k}$ under the constraint that $\left\| \mathbf{x} \right\| = 1$ and $e_{1} , \cdots , e_{k-1}$ are orthogonal to each other. In other words, $$ \text{Var} \left( Y_{k} \right) = \lambda_{k} $$ under the constraint condition $\mathbf{x} \perp e_{1} , \cdots , e_{k-1}$, the covariance of $Y_{l}, Y_{k}$ is as follows. $$ \begin{align*} \text{Cov} \left( Y_{l} , Y_{k} \right) =& e_{l}^{T} \Sigma e_{k} \\ =& e_{l}^{T} \lambda_{k} e_{k} \\ =& \lambda_{k} e_{l}^{T} e_{k} \\ =& \lambda_{k} \cdot 0 \\ =& 0 \end{align*} $$

See Also


  1. Johnson. (2013). Applied Multivariate Statistical Analysis(6th Edition): p430~432. ↩︎