Principal Component Analysis in Statistics
Overview
Think about performing Multiple Regression Analysis . Principal Component Analysis, abbreviated as PCA in English, is, in simple terms, a method of ‘restructuring’ quantitative variables so that they are properly independent for analysis. From the perspective of multivariate data analysis, it has the significance of ‘dimension reduction’ as a means to explain phenomena with fewer variables.
To properly understand the theoretical derivation of principal component analysis, knowledge of Linear Algebra, and if possible, Numerical Linear Algebra is required. If it’s completely confusing, try reading and understanding Steps 3 and 4 as well. If you’re somewhat confident in Mathematical Statistics, it would also be good to read the post on Principal Component Analysis in Mathematical Statistics.
Derivation 1
Step 1. Data with independent variables and samples standardized
can be expressed as using the design matrix. Here, a matrix standardized to matrix means for the th independent variable , its sample mean and sample standard deviation , the -component is
. Then, for the new regression coefficient
, we can obtain the design matrix equation for regression analysis without an intercept as . This becomes a matrix consisting of vector standardized to .
Step 2.
Considering the spectral decomposition, is a symmetric matrix, and thinking about its definition, becomes the covariance matrix for . Especially, since is a standardized matrix, it also serves as a correlation coefficient matrix. According to Spectral Theory
, there exist an orthogonal matrix
that satisfies it, and a diagonal matrix
composed of the eigenvalues of . Now, for convenience, let’s assume and think of as a matrix reorganized accordingly.
Step 3. Composition of Principal Components
Since , where we set and , the Principal Components for are . The form of the th principal component is
, restructuring the original independent variables into a linear combination.
Step 4.
The independence of the principal components can also be confirmed by the following calculation: In other words, This means the principal components are necessarily independent, and if the eigenvalue is close to , it implies that is close to a zero vector, indicating that may have multicollinearity.
■
Limitations
Principal Component Regression Analysis circumvents the issue of multicollinearity by removing variables with problematic eigenvalues. Moreover, since it uses significantly fewer variables compared to the original regression analysis, it can be said that the dimension has been reduced.
At first glance, principal component analysis may seem like a panacea, but that is not always the case. Firstly, the fact that standardization is required to create means that there are many difficulties in dealing with qualitative variables or transformations, and this ‘restructuring’ process makes the analysis itself harder to understand.
Considering that statistics are necessary even for those who do not understand statistics, this point is quite fatal. For example, if principal component analysis is used for analyzing the Korean economy, instead of easily understandable figures like the unemployment rate or average starting salary , it might be expressed in strange terms like ‘comprehensive employment index’ . Even the analyst, who might have produced a usable regression formula, may not grasp its true meaning, which could lead to a disaster. (In computer science, prediction and classification are more important than understanding data, so this downside is not taken as seriously.)
Also, if none of the principal components are excluded and is used as is, there’s no difference from the original , which means giving up the original data itself. It might still be necessary to use it, but if not, there’s no reason to. Even when using it, it’s crucial to be fully aware of its disadvantages and limitations.
Condition Number 1
Meanwhile, the Condition Number, a numerical indicator for diagnosing multicollinearity through the eigenvalues obtained in the derivation process
, can be calculated. Empirically, if , it’s presumed there is multicollinearity in the original data, although it’s not widely used.