logo

Principal Component Analysis in Statistics 📂Statistical Analysis

Principal Component Analysis in Statistics

Overview

Think about performing Multiple Regression Analysis YX1,,XpY \gets X_{1} , \cdots, X_{p}. Principal Component Analysis, abbreviated as PCA in English, is, in simple terms, a method of ‘restructuring’ quantitative variables so that they are properly independent for analysis. From the perspective of multivariate data analysis, it has the significance of ‘dimension reduction’ as a means to explain phenomena with fewer variables.

To properly understand the theoretical derivation of principal component analysis, knowledge of Linear Algebra, and if possible, Numerical Linear Algebra is required. If it’s completely confusing, try reading and understanding Steps 3 and 4 as well. If you’re somewhat confident in Mathematical Statistics, it would also be good to read the post on Principal Component Analysis in Mathematical Statistics.

Derivation 1

Step 1. Data with pp independent variables and nn samples standardized

[y1y2yn]=[1x11x1p1x21x2p1xn1xnp][β0β1βp] \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1p} \\ 1 & x_{21} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{np} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p} \end{bmatrix}

can be expressed as Y=Xβ+εY = X \beta + \varepsilon using the design matrix. Here, a matrix XX standardized to matrix ZZ means for the jjth independent variable XjX_{j}, its sample mean xj\overline{ x_{j} } and sample standard deviation sXjs_{ X_{j} }, the (i,j)(i,j)-component is

(Z)ij:=xijxjsXj \left( Z \right)_{ij}: = {{ x_{ij} } - \overline{ x_{j} } \over { s_{ X_{j} } }}

. Then, for the new regression coefficient

Θ:=[θ1θ2θp] \Theta := \begin{bmatrix} \theta_{1} \\ \theta_{2} \\ \vdots \\ \theta_{p} \end{bmatrix}

, we can obtain the design matrix equation for regression analysis without an intercept as Y=ZΘ+εY = Z \Theta + \varepsilon. This Z=[Z1Zp]Z = \begin{bmatrix} Z_{1} & \cdots & Z_{p} \end{bmatrix} becomes a matrix (n×p)( n \times p ) consisting of vector X1,,XpX_{1} , \cdots , X_{p} standardized to Z1,,ZpZ_{1} , \cdots , Z_{p}.


Step 2.

Considering the spectral decomposition, ZTZZ^{T} Z is a symmetric matrix, and thinking about its definition, 1n1ZTZ\displaystyle {{1} \over {n-1}} Z^{T} Z becomes the covariance matrix for Z1,,ZpZ_{1} , \cdots , Z_{p}. Especially, since ZZ is a standardized matrix, it also serves as a correlation coefficient matrix. According to Spectral Theory

{ZTZ=QΛQTQTQ=QQT=I \begin{cases} Z^{T} Z = Q \Lambda Q^{T} \\ Q^{T} Q = Q Q^{T} = I \end{cases}

, there exist an orthogonal matrix

Q=[q11q12q1pq21q22q2pqp1qp2qpp] Q = \begin{bmatrix} q_{11} & q_{12} & \cdots & q_{1p} \\ q_{21} & q_{22} & \cdots & q_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ q_{p1} & q_{p2} & \cdots & q_{pp} \end{bmatrix}

that satisfies it, and a diagonal matrix

Λ=diag(λ1,λ2,,λp)=[λ1000λ2000λp] \Lambda = \text{diag} ( \lambda_{1} , \lambda_{2} , \cdots , \lambda_{p} ) = \begin{bmatrix} \lambda_{1} & 0 & \cdots & 0 \\ 0 & \lambda_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_{p} \end{bmatrix}

composed of the eigenvalues of ZTZZ^{T} Z. Now, for convenience, let’s assume λ1λ2λp\lambda_{1} \ge \lambda_{2} \ge \cdots \ge \lambda_{p} and think of ZZ as a matrix reorganized accordingly.


Step 3. Composition of Principal Components

Since I=QQTI = QQ^{T}, Y=ZΘ+ε=ZQQTΘ+ε Y = Z \Theta + \varepsilon = Z Q Q^{T} \Theta + \varepsilon where we set C:=ZQC := ZQ and α:=QTΘ\alpha := Q^{T} \Theta, Y=Cα+ε Y = C \alpha + \varepsilon the Principal Components for C=[C1Cp]C = \begin{bmatrix} C_{1} & \cdots & C_{p} \end{bmatrix} are C1,,CpC_{1} , \cdots , C_{p}. The form of the jjth principal component is

Cj=q1jZ1++qpjZp=i=1pqijZj C_{j} = q_{1j} Z_{1} + \cdots + q_{pj} Z_{p} = \sum_{i=1}^{p} q_{ij} Z_{j}

, restructuring the original independent variables into a linear combination.


Step 4.

The independence of the principal components can also be confirmed by the following calculation: ZTZ=QΛQT    QTZTZQ=Λ    (ZQ)T(ZQ)=Λ    CTC=Λ \begin{align*} & Z^{T} Z = Q \Lambda Q^{T} \\ \implies& Q^{T} Z^{T} Z Q = \Lambda \\ \implies& \left( Z Q \right) ^{T} \left( Z Q \right) = \Lambda \\ \implies& C^{T} C = \Lambda \end{align*} In other words, CjTCj={λj,i=j0,ij C_{j}^{T} C_{j} = \begin{cases} \lambda_{j} & , i=j \\ 0 & , i \ne j \end{cases} This means the principal components are necessarily independent, and if the eigenvalue λj\lambda_{j} is close to 00, it implies that Cj=i=1pqijZj\displaystyle C_{j} = \sum_{i=1}^{p} q_{ij} Z_{j} is close to a zero vector, indicating that Z1,,ZpZ_{1} , \cdots , Z_{p} may have multicollinearity.

Limitations

Principal Component Regression Analysis YC1,,CpY \gets C_{1} , \cdots , C_{p} circumvents the issue of multicollinearity by removing variables with problematic eigenvalues. Moreover, since it uses significantly fewer variables compared to the original regression analysis, it can be said that the dimension has been reduced.

At first glance, principal component analysis may seem like a panacea, but that is not always the case. Firstly, the fact that standardization is required to create ZZ means that there are many difficulties in dealing with qualitative variables or transformations, and this ‘restructuring’ process makes the analysis itself harder to understand.

Considering that statistics are necessary even for those who do not understand statistics, this point is quite fatal. For example, if principal component analysis is used for analyzing the Korean economy, instead of easily understandable figures like the unemployment rate X2X_{2} or average starting salary X7X_{7}, it might be expressed in strange terms like ‘comprehensive employment index’ C4C_{4}. Even the analyst, who might have produced a usable regression formula, may not grasp its true meaning, which could lead to a disaster. (In computer science, prediction and classification are more important than understanding data, so this downside is not taken as seriously.)

Also, if none of the principal components are excluded and YC1,,CpY \gets C_{1} , \cdots , C_{p} is used as is, there’s no difference from the original YX1,,XpY \gets X_{1} , \cdots , X_{p}, which means giving up the original data itself. It might still be necessary to use it, but if not, there’s no reason to. Even when using it, it’s crucial to be fully aware of its disadvantages and limitations.

Condition Number 1

Meanwhile, the Condition Number, a numerical indicator for diagnosing multicollinearity through the eigenvalues obtained in the derivation process

κ:=λ1λp \kappa := \sqrt{ {{ \lambda_{1} } \over { \lambda_{p} }} }

, can be calculated. Empirically, if κ>15\kappa > 15, it’s presumed there is multicollinearity in the original data, although it’s not widely used.

See Also


  1. Hadi. (2006). Regression Analysis by Example(4th Edition): p255~257. ↩︎ ↩︎