logo

Principal Component Analysis (PCA) in Mathematical Statistics 📂Mathematical Statistics

Principal Component Analysis (PCA) in Mathematical Statistics

Overview

Principal Component Analysis (PCA) has many uses in statistics, such as avoiding multicollinearity in regression analysis and summarizing data, and holds significant importance in machine learning as a way of dimensionality reduction. This post focuses on the theoretical foundations of deriving principal components rather than on practical usage.

Definition 1

Principal Component Analysis

Let’s assume a random vector X=(X1,,Xp)\mathbf{X} = \left( X_{1} , \cdots , X_{p} \right) is given. Considering the linear combination of random variables X1,,XpX_{1} , \cdots , X_{p} akTX=ak1X1++akpXp=l=1paklXl \mathbf{a}_{k}^{T} \mathbf{X} = a_{k1} X_{1} + \cdots + a_{kp} X_{p} = \sum_{l = 1}^{p} a_{kl} X_{l} where for a length 11 unit vector is denoted as ak=(ak1,,akp)Rp\mathbf{a}_{k} = \left( a_{k1} , \cdots , a_{kp} \right) \in \mathbb{R}^{p}, the goal is to maximize the variance of the first a1TX\mathbf{a}_{1}^{T} \mathbf{X} Var(a1TX) \operatorname{Var} \left( \mathbf{a}_{1}^{T} \mathbf{X} \right) and, while satisfying Cov(a1TX,a2TX)=0\operatorname{Cov} \left( \mathbf{a}_{1}^{T} \mathbf{X} , \mathbf{a}_{2}^{T} \mathbf{X} \right) = 0, also maximize the variance of the second a2TX\mathbf{a}_{2}^{T} \mathbf{X} Var(a2TX) \operatorname{Var} \left( \mathbf{a}_{2}^{T} \mathbf{X} \right) and, for all l<kl < k satisfying Cov(alTX,akTX)=0\operatorname{Cov} \left( \mathbf{a}_{l}^{T} \mathbf{X} , \mathbf{a}_{k}^{T} \mathbf{X} \right) = 0, also maximize the variance of the kkth aiTX\mathbf{a}_{i}^{T} \mathbf{X} Var(akTX) \operatorname{Var} \left( \mathbf{a}_{k}^{T} \mathbf{X} \right) Vectors a1,,ap\mathbf{a}_{1} , \cdots , \mathbf{a}_{p} that accomplish this, through which the data is analyzed, are known as Principal Component Analysis, PCA.

Principal Components

Assuming the covariance matrix ΣRp×p\Sigma \in \mathbb{R}^{p \times p} of random vector X\mathbf{X} has its eigenpairs {(λk,ek)}k=1p\left\{ \left( \lambda_{k} , e_{k} \right) \right\}_{k=1}^{p} arranged in a certain order λ1λp0\lambda_{1} \ge \cdots \ge \lambda_{p} \ge 0, and eke_{k} vectors are ek=1\left\| e_{k} \right\| = 1, meaning they are normalized. The random variable YkY_{k} defined by the inner product between the random vector X\mathbf{X} and the kkth eigenvector eke_{k} is called the kkth Principal Component. Yk:=ekTX Y_{k} := e_{k}^{T} \mathbf{X} The realization yky_{k} of YkY_{k} is called the kkth PC Score.


Theorem

Unit vectors that maximize the principal components in PCA are obtained as follows ak=ek\mathbf{a}_{k} = e_{k}. For k=1,,pk = 1 , \cdots , p and iji \ne j, the variance and covariance of the kkth principal component are as follows. Var(Yk)=λkCov(Yi,Yj)=0 \begin{align*} \operatorname{Var} \left( Y_{k} \right) =& \lambda_{k} \\ \operatorname{Cov} \left( Y_{i}, Y_{j} \right) =& 0 \end{align*}

Proof

The covariance matrix Σ\Sigma can be expanded as follows. Σ=Cov(X)=(Var(X1)Cov(X1,X2)Cov(X1,Xp)Cov(X2,X1)Var(X2)Cov(X2,Xp)Cov(Xp,X1)Cov(Xp,X2)Var(Xp)) \begin{align*} & \Sigma \\ =& \operatorname{Cov} \left( \mathbf{X} \right) \\ =& \begin{pmatrix} \operatorname{Var} \left( X_{1} \right) & \operatorname{Cov} \left( X_{1} , X_{2} \right) & \cdots & \operatorname{Cov} \left( X_{1} , X_{p} \right) \\ \operatorname{Cov} \left( X_{2} , X_{1} \right) & \operatorname{Var} \left( X_{2} \right) & \cdots & \operatorname{Cov} \left( X_{2} , X_{p} \right) \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{Cov} \left( X_{p} , X_{1} \right) & \operatorname{Cov} \left( X_{p} , X_{2} \right) & \cdots & \operatorname{Var} \left( X_{p} \right) \end{pmatrix} \end{align*}

Properties of the Covariance Matrix: Given a constant matrix ARk×pA \in \mathbb{R}^{k \times p} is provided like (A)ij:=aij(A)_{ij} := a_{ij}, then Cov(AX)=ACov(X)AT \operatorname{Cov} ( A \mathbf{X}) = A \operatorname{Cov} \left( \mathbf{X} \right) A^{T}

Defining the orthogonal matrix P:=[e1ep]Rp×pP := \begin{bmatrix} e_{1} & \cdots & e_{p} \end{bmatrix} \in \mathbb{R}^{p \times p} composed of eigenvectors of Σ\Sigma, the covariance of random vector Y:=PTX\mathbf{Y} := P^{T} \mathbf{X} can be represented as follows according to the properties. Cov(Y)=Cov(PTX)=PTCov(X)P \begin{align*} & \operatorname{Cov} \left( \mathbf{Y} \right) \\ =& \operatorname{Cov} \left( P^{T} \mathbf{X} \right) \\ =& P^{T} \operatorname{Cov} \left( \mathbf{X} \right) P \end{align*}

By expanding the first diagonal component of this Cov(Y)\operatorname{Cov} \left( \mathbf{Y} \right), it can be found to be equal to the first eigenvalue. Var(Y1)=e1TVarXe1=e1TΣe1=λ1 \begin{align*} \operatorname{Var} \left( Y_{1} \right) =& e_{1}^{T} \operatorname{Var} \mathbf{X} e_{1} \\ =& e_{1}^{T} \Sigma e_{1} \\ =& \lambda_{1} \end{align*}

Quadratic Forms and Eigenvalues of Positive-Definite Matrices: Assuming the eigenpair {(λk,ek)}k=1n\left\{ \left( \lambda_{k} , e_{k} \right) \right\}_{k=1}^{n} of a positive-definite matrix ARp×pA \in \mathbb{R}^{p \times p} is ordered like λ1λn0\lambda_{1} \ge \cdots \ge \lambda_{n} \ge 0, the maximum and minimum values of the quadratic form xTAx\mathbf{x}^{T} A \mathbf{x} on the unit sphere are as follows. maxx=1xTAx=λ1, attained when x=e1minx=1xTAx=λp, attained when x=ep \begin{align*} \max_{\left\| \mathbf{x} \right\| = 1} \mathbf{x}^{T} A \mathbf{x} =& \lambda_{1} & \text{, attained when } \mathbf{x} = e_{1} \\ \min_{\left\| \mathbf{x} \right\| = 1} \mathbf{x}^{T} A \mathbf{x} =& \lambda_{p} & \text{, attained when } \mathbf{x} = e_{p} \end{align*} Meanwhile, for k=2,,p1k = 2, \cdots , p-1, the following holds. maxx=1xe1,,ek1xTAx=λk, attained when x=ek \max_{\substack{\left\| \mathbf{x} \right\| = 1 \\ \mathbf{x} \perp e_{1} , \cdots , e_{k-1} }} \mathbf{x}^{T} A \mathbf{x} = \lambda_{k} \quad \text{, attained when } \mathbf{x} = e_{k}

Eventually, this λ1\lambda_{1} is the maximum value of the quadratic form xTΣx\mathbf{x}^{T} \Sigma \mathbf{x} under the constraint x=1\left\| \mathbf{x} \right\| = 1. Summarizing, Var(Y1)=λ1=maxx=1xTΣx \operatorname{Var} \left( Y_{1} \right) = \lambda_{1} = \max_{\left\| \mathbf{x} \right\| = 1} \mathbf{x}^{T} \Sigma \mathbf{x} by the same theorem, the variance Var(Yk)\operatorname{Var} \left( Y_{k} \right) of the kkth is equal to the kkth eigenvalue λk\lambda_{k} under the constraint that x=1\left\| \mathbf{x} \right\| = 1 and e1,,ek1e_{1} , \cdots , e_{k-1} are orthogonal to each other. In other words, Var(Yk)=λk \operatorname{Var} \left( Y_{k} \right) = \lambda_{k} under the constraint condition xe1,,ek1\mathbf{x} \perp e_{1} , \cdots , e_{k-1}, the covariance of Yl,YkY_{l}, Y_{k} is as follows. Cov(Yl,Yk)=elTΣek=elTλkek=λkelTek=λk0=0 \begin{align*} \operatorname{Cov} \left( Y_{l} , Y_{k} \right) =& e_{l}^{T} \Sigma e_{k} \\ =& e_{l}^{T} \lambda_{k} e_{k} \\ =& \lambda_{k} e_{l}^{T} e_{k} \\ =& \lambda_{k} \cdot 0 \\ =& 0 \end{align*}

See Also


  1. Johnson. (2013). Applied Multivariate Statistical Analysis(6th Edition): p430~432. ↩︎