How to Perform Principal Component Regression in R
Overview
Principal Components Regression (PCR) combines Principal Component Analysis and Multiple Regression Analysis. It involves using the principal components derived from PCA as new independent variables for regression analysis. From a statistical perspective, PCA itself might not be necessary, and its relevance usually comes into play for regression analysis.
Practice
(Following the method to detect multicollinearity)
Although generating principal components involves complex computations including matrix decomposition, in R, this can be easily achieved using the prcomp()
function.
- (1) Rotation: This shows the coefficients used to calculate the principal components by multiplying them with the original independent variables. For instance, to create PC1, you multiply
population
by $-0.098$,white
by $0.056$, …, andpoverty
by $-0.187$, then sum them all up. Additionally, depending on the data, it can serve as an indicator of which principal component holds a lot of information about which variable at a glance. - (2) Importance of components: This literally represents the importance of each principal component. The higher the Proportion of Variance, the more significant part of the data it explains, and hence, the more critical the principal component is. This is visually represented by the Scree Plot on the right. A Scree plot can be easily created just by inserting the result of PCA into the plot() function. This plot helps decide how many principal components to use.
- (3) Elbow Point: In the Scree plot, this is the point where the slope changes abruptly, indicating that using more principal components beyond this point yields diminishing returns. It is generally fine not to use principal components beyond the Elbow point. In cases like the example, it would be advisable to use up to PC1, PC2, and PC3.
However, the real challenge lies in how to incorporate this into regression analysis. Creating PCs by rotating and multiplying the independent variables, and then conducting regression analysis, followed by deriving a formula for the original independent variables can be quite tedious. Therefore, let’s try using the pcr()
function from the pls package.
The output from the summary()
function shows how well collgeGrad
is explained by the number of principal components used.
As derived from the principal component analysis, $3$ principal components were sufficient. Executing jack.test()
function with the ncomp
option set to 3
not only provides the regression coefficients for the original independent variables instead of the principal components but also performs a t-test concurrently.
Additionally, looking at the residual plot on the right, there don’t seem to be any significant issues.
win.graph(4,4); plot(scale(PCRout$residuals[,,3]),main="Residual")
잔차그림을 출력할 때 주의할 것은 pcr()
함수가 반환하는 결과에 `$residuals, the obtained results are correct, but it means that depending on how many principal components were used, the residuals are included in a 3-dimensional array. In this example, slicing with [,,3] allowed referencing the PCR residuals with up to three principal components used. If the analysis used k principal components, it should be sliced as [,,k].
Code
Here is the example code.
install.packages('pls')
install.packages('car')
library(pls)
library(car)
X<-MplsDemo[,-1]
Y<-X[,7]; X<-X[,-7]
PCAout<-prcomp(X[,-7],center=T,scale.=T); PCAout; summary(PCAout)
win.graph(3,3); plot(PCAout, type='l',main="screeplot")
PCRout<-pcr(collegeGrad~.,data=MplsDemo[,-1],validation="LOO",jackknife=T)
summary(PCRout)
jack.test(PCRout,ncomp=3)
win.graph(4,4); plot(scale(PCRout$residuals[,,3]),main="Residual")