Principal Component Analysis (PCA) in Mathematical Statistics
Overview
Principal Component Analysis (PCA) has many uses in statistics, such as avoiding multicollinearity in regression analysis and summarizing data, and holds significant importance in machine learning as a way of dimensionality reduction. This post focuses on the theoretical foundations of deriving principal components rather than on practical usage.
Definition 1
Principal Component Analysis
Let’s assume a random vector $\mathbf{X} = \left( X_{1} , \cdots , X_{p} \right)$ is given. Considering the linear combination of random variables $X_{1} , \cdots , X_{p}$ $$ \mathbf{a}_{k}^{T} \mathbf{X} = a_{k1} X_{1} + \cdots + a_{kp} X_{p} = \sum_{l = 1}^{p} a_{kl} X_{l} $$ where for a length $1$ unit vector is denoted as $\mathbf{a}_{k} = \left( a_{k1} , \cdots , a_{kp} \right) \in \mathbb{R}^{p}$, the goal is to maximize the variance of the first $\mathbf{a}_{1}^{T} \mathbf{X}$ $$ \operatorname{Var} \left( \mathbf{a}_{1}^{T} \mathbf{X} \right) $$ and, while satisfying $\operatorname{Cov} \left( \mathbf{a}_{1}^{T} \mathbf{X} , \mathbf{a}_{2}^{T} \mathbf{X} \right) = 0$, also maximize the variance of the second $\mathbf{a}_{2}^{T} \mathbf{X}$ $$ \operatorname{Var} \left( \mathbf{a}_{2}^{T} \mathbf{X} \right) $$ and, for all $l < k$ satisfying $\operatorname{Cov} \left( \mathbf{a}_{l}^{T} \mathbf{X} , \mathbf{a}_{k}^{T} \mathbf{X} \right) = 0$, also maximize the variance of the $k$th $\mathbf{a}_{i}^{T} \mathbf{X}$ $$ \operatorname{Var} \left( \mathbf{a}_{k}^{T} \mathbf{X} \right) $$ Vectors $\mathbf{a}_{1} , \cdots , \mathbf{a}_{p}$ that accomplish this, through which the data is analyzed, are known as Principal Component Analysis, PCA.
Principal Components
Assuming the covariance matrix $\Sigma \in \mathbb{R}^{p \times p}$ of random vector $\mathbf{X}$ has its eigenpairs $\left\{ \left( \lambda_{k} , e_{k} \right) \right\}_{k=1}^{p}$ arranged in a certain order $\lambda_{1} \ge \cdots \ge \lambda_{p} \ge 0$, and $e_{k}$ vectors are $\left\| e_{k} \right\| = 1$, meaning they are normalized. The random variable $Y_{k}$ defined by the inner product between the random vector $\mathbf{X}$ and the $k$th eigenvector $e_{k}$ is called the $k$th Principal Component. $$ Y_{k} := e_{k}^{T} \mathbf{X} $$ The realization $y_{k}$ of $Y_{k}$ is called the $k$th PC Score.
- $X^{T}$ is the transposed matrix of matrix $X$.
- $\left\| \cdot \right\|$ is the Euclidean norm.
- $\perp$ refers to the orthogonality of vectors.
Theorem
Unit vectors that maximize the principal components in PCA are obtained as follows $\mathbf{a}_{k} = e_{k}$. For $k = 1 , \cdots , p$ and $i \ne j$, the variance and covariance of the $k$th principal component are as follows. $$ \begin{align*} \operatorname{Var} \left( Y_{k} \right) =& \lambda_{k} \\ \operatorname{Cov} \left( Y_{i}, Y_{j} \right) =& 0 \end{align*} $$
Proof
The covariance matrix $\Sigma$ can be expanded as follows. $$ \begin{align*} & \Sigma \\ =& \operatorname{Cov} \left( \mathbf{X} \right) \\ =& \begin{pmatrix} \operatorname{Var} \left( X_{1} \right) & \operatorname{Cov} \left( X_{1} , X_{2} \right) & \cdots & \operatorname{Cov} \left( X_{1} , X_{p} \right) \\ \operatorname{Cov} \left( X_{2} , X_{1} \right) & \operatorname{Var} \left( X_{2} \right) & \cdots & \operatorname{Cov} \left( X_{2} , X_{p} \right) \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{Cov} \left( X_{p} , X_{1} \right) & \operatorname{Cov} \left( X_{p} , X_{2} \right) & \cdots & \operatorname{Var} \left( X_{p} \right) \end{pmatrix} \end{align*} $$
Properties of the Covariance Matrix: Given a constant matrix $A \in \mathbb{R}^{k \times p}$ is provided like $(A)_{ij} := a_{ij}$, then $$ \operatorname{Cov} ( A \mathbf{X}) = A \operatorname{Cov} \left( \mathbf{X} \right) A^{T} $$
Defining the orthogonal matrix $P := \begin{bmatrix} e_{1} & \cdots & e_{p} \end{bmatrix} \in \mathbb{R}^{p \times p}$ composed of eigenvectors of $\Sigma$, the covariance of random vector $\mathbf{Y} := P^{T} \mathbf{X}$ can be represented as follows according to the properties. $$ \begin{align*} & \operatorname{Cov} \left( \mathbf{Y} \right) \\ =& \operatorname{Cov} \left( P^{T} \mathbf{X} \right) \\ =& P^{T} \operatorname{Cov} \left( \mathbf{X} \right) P \end{align*} $$
By expanding the first diagonal component of this $\operatorname{Cov} \left( \mathbf{Y} \right)$, it can be found to be equal to the first eigenvalue. $$ \begin{align*} \operatorname{Var} \left( Y_{1} \right) =& e_{1}^{T} \operatorname{Var} \mathbf{X} e_{1} \\ =& e_{1}^{T} \Sigma e_{1} \\ =& \lambda_{1} \end{align*} $$
Quadratic Forms and Eigenvalues of Positive-Definite Matrices: Assuming the eigenpair $\left\{ \left( \lambda_{k} , e_{k} \right) \right\}_{k=1}^{n}$ of a positive-definite matrix $A \in \mathbb{R}^{p \times p}$ is ordered like $\lambda_{1} \ge \cdots \ge \lambda_{n} \ge 0$, the maximum and minimum values of the quadratic form $\mathbf{x}^{T} A \mathbf{x}$ on the unit sphere are as follows. $$ \begin{align*} \max_{\left\| \mathbf{x} \right\| = 1} \mathbf{x}^{T} A \mathbf{x} =& \lambda_{1} & \text{, attained when } \mathbf{x} = e_{1} \\ \min_{\left\| \mathbf{x} \right\| = 1} \mathbf{x}^{T} A \mathbf{x} =& \lambda_{p} & \text{, attained when } \mathbf{x} = e_{p} \end{align*} $$ Meanwhile, for $k = 2, \cdots , p-1$, the following holds. $$ \max_{\substack{\left\| \mathbf{x} \right\| = 1 \\ \mathbf{x} \perp e_{1} , \cdots , e_{k-1} }} \mathbf{x}^{T} A \mathbf{x} = \lambda_{k} \quad \text{, attained when } \mathbf{x} = e_{k} $$
Eventually, this $\lambda_{1}$ is the maximum value of the quadratic form $\mathbf{x}^{T} \Sigma \mathbf{x}$ under the constraint $\left\| \mathbf{x} \right\| = 1$. Summarizing, $$ \operatorname{Var} \left( Y_{1} \right) = \lambda_{1} = \max_{\left\| \mathbf{x} \right\| = 1} \mathbf{x}^{T} \Sigma \mathbf{x} $$ by the same theorem, the variance $\operatorname{Var} \left( Y_{k} \right)$ of the $k$th is equal to the $k$th eigenvalue $\lambda_{k}$ under the constraint that $\left\| \mathbf{x} \right\| = 1$ and $e_{1} , \cdots , e_{k-1}$ are orthogonal to each other. In other words, $$ \operatorname{Var} \left( Y_{k} \right) = \lambda_{k} $$ under the constraint condition $\mathbf{x} \perp e_{1} , \cdots , e_{k-1}$, the covariance of $Y_{l}, Y_{k}$ is as follows. $$ \begin{align*} \operatorname{Cov} \left( Y_{l} , Y_{k} \right) =& e_{l}^{T} \Sigma e_{k} \\ =& e_{l}^{T} \lambda_{k} e_{k} \\ =& \lambda_{k} e_{l}^{T} e_{k} \\ =& \lambda_{k} \cdot 0 \\ =& 0 \end{align*} $$
■
See Also
Johnson. (2013). Applied Multivariate Statistical Analysis(6th Edition): p430~432. ↩︎