Principal Component Analysis in Statistics 📂Statistical Analysis

Principal Component Analysis in Statistics

Overview

Think about performing Multiple Regression Analysis $Y \gets X_{1} , \cdots, X_{p}$ . Principal Component Analysis, abbreviated as PCA in English, is, in simple terms, a method of ‘restructuring’ quantitative variables so that they are properly independent for analysis. From the perspective of multivariate data analysis, it has the significance of ‘dimension reduction’ as a means to explain phenomena with fewer variables.

To properly understand the theoretical derivation of principal component analysis, knowledge of Linear Algebra, and if possible, Numerical Linear Algebra is required. If it’s completely confusing, try reading and understanding Steps 3 and 4 as well. If you’re somewhat confident in Mathematical Statistics, it would also be good to read the post on Principal Component Analysis in Mathematical Statistics.

Derivation ¹

Step 1. Data with $p$ independent variables and $n$ samples standardized

$\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1p} \\ 1 & x_{21} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{np} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p} \end{bmatrix}$

can be expressed as $Y = X \beta + \varepsilon$ using the design matrix. Here, a matrix $X$ standardized to matrix $Z$ means for the $j$ th independent variable $X_{j}$ , its sample mean $\overline{ x_{j} }$ and sample standard deviation $s_{ X_{j} }$ , the $(i,j)$ -component is

$\left( Z \right)_{ij}: = {{ x_{ij} } - \overline{ x_{j} } \over { s_{ X_{j} } }}$

. Then, for the new regression coefficient

$\Theta := \begin{bmatrix} \theta_{1} \\ \theta_{2} \\ \vdots \\ \theta_{p} \end{bmatrix}$

, we can obtain the design matrix equation for regression analysis without an intercept as $Y = Z \Theta + \varepsilon$ . This $Z = \begin{bmatrix} Z_{1} & \cdots & Z_{p} \end{bmatrix}$ becomes a matrix $( n \times p )$ consisting of vector $X_{1} , \cdots , X_{p}$ standardized to $Z_{1} , \cdots , Z_{p}$ .

Step 2.

Considering the spectral decomposition, $Z^{T} Z$ is a symmetric matrix, and thinking about its definition, $\displaystyle {{1} \over {n-1}} Z^{T} Z$ becomes the covariance matrix for $Z_{1} , \cdots , Z_{p}$ . Especially, since $Z$ is a standardized matrix, it also serves as a correlation coefficient matrix. According to Spectral Theory

$\begin{cases} Z^{T} Z = Q \Lambda Q^{T} \\ Q^{T} Q = Q Q^{T} = I \end{cases}$

, there exist an orthogonal matrix

$Q = \begin{bmatrix} q_{11} & q_{12} & \cdots & q_{1p} \\ q_{21} & q_{22} & \cdots & q_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ q_{p1} & q_{p2} & \cdots & q_{pp} \end{bmatrix}$

that satisfies it, and a diagonal matrix

$\Lambda = \text{diag} ( \lambda_{1} , \lambda_{2} , \cdots , \lambda_{p} ) = \begin{bmatrix} \lambda_{1} & 0 & \cdots & 0 \\ 0 & \lambda_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_{p} \end{bmatrix}$

composed of the eigenvalues of $Z^{T} Z$ . Now, for convenience, let’s assume $\lambda_{1} \ge \lambda_{2} \ge \cdots \ge \lambda_{p}$ and think of $Z$ as a matrix reorganized accordingly.

Step 3. Composition of Principal Components

Since $I = QQ^{T}$ , $Y = Z \Theta + \varepsilon = Z Q Q^{T} \Theta + \varepsilon$ where we set $C := ZQ$ and $\alpha := Q^{T} \Theta$ , $Y = C \alpha + \varepsilon$ the Principal Components for $C = \begin{bmatrix} C_{1} & \cdots & C_{p} \end{bmatrix}$ are $C_{1} , \cdots , C_{p}$ . The form of the $j$ th principal component is

$C_{j} = q_{1j} Z_{1} + \cdots + q_{pj} Z_{p} = \sum_{i=1}^{p} q_{ij} Z_{j}$

, restructuring the original independent variables into a linear combination.

Step 4.

The independence of the principal components can also be confirmed by the following calculation: $\begin{align*} & Z^{T} Z = Q \Lambda Q^{T} \\ \implies& Q^{T} Z^{T} Z Q = \Lambda \\ \implies& \left( Z Q \right) ^{T} \left( Z Q \right) = \Lambda \\ \implies& C^{T} C = \Lambda \end{align*}$ In other words, $C_{j}^{T} C_{j} = \begin{cases} \lambda_{j} & , i=j \\ 0 & , i \ne j \end{cases}$ This means the principal components are necessarily independent, and if the eigenvalue $\lambda_{j}$ is close to $0$ , it implies that $\displaystyle C_{j} = \sum_{i=1}^{p} q_{ij} Z_{j}$ is close to a zero vector, indicating that $Z_{1} , \cdots , Z_{p}$ may have multicollinearity.

■

Limitations

Principal Component Regression Analysis $Y \gets C_{1} , \cdots , C_{p}$ circumvents the issue of multicollinearity by removing variables with problematic eigenvalues. Moreover, since it uses significantly fewer variables compared to the original regression analysis, it can be said that the dimension has been reduced.

At first glance, principal component analysis may seem like a panacea, but that is not always the case. Firstly, the fact that standardization is required to create $Z$ means that there are many difficulties in dealing with qualitative variables or transformations, and this ‘restructuring’ process makes the analysis itself harder to understand.

Considering that statistics are necessary even for those who do not understand statistics, this point is quite fatal. For example, if principal component analysis is used for analyzing the Korean economy, instead of easily understandable figures like the unemployment rate $X_{2}$ or average starting salary $X_{7}$ , it might be expressed in strange terms like ‘comprehensive employment index’ $C_{4}$ . Even the analyst, who might have produced a usable regression formula, may not grasp its true meaning, which could lead to a disaster. (In computer science, prediction and classification are more important than understanding data, so this downside is not taken as seriously.)

Also, if none of the principal components are excluded and $Y \gets C_{1} , \cdots , C_{p}$ is used as is, there’s no difference from the original $Y \gets X_{1} , \cdots , X_{p}$ , which means giving up the original data itself. It might still be necessary to use it, but if not, there’s no reason to. Even when using it, it’s crucial to be fully aware of its disadvantages and limitations.

Condition Number ¹

Meanwhile, the Condition Number, a numerical indicator for diagnosing multicollinearity through the eigenvalues obtained in the derivation process

$\kappa := \sqrt{ {{ \lambda_{1} } \over { \lambda_{p} }} }$

, can be calculated. Empirically, if $\kappa > 15$ , it’s presumed there is multicollinearity in the original data, although it’s not widely used.