Regression Model Diagnostics
Necessity
In the case of simple regression analysis, since it involves only one independent variable and one dependent variable, making it a $2$ dimensional analysis, it is easy to visually confirm if the analysis was conducted properly. However, for multiple regression analyses that exceed $3$ dimensions, it becomes difficult to represent the data graphically, making it hard to verify the accuracy of the analysis. There are instances where the analysis passes hypothesis testing even though it does not satisfy the assumptions of regression analysis, in which case the analysis is essentially incorrect.
Errors in analysis are mainly due to (1) the data not fitting the linear model, or (2) a significant discrepancy between the analysis results and the actual understanding of the data. Model diagnostics are conducted to check whether the data fits the linear model.
Diagnostic Methods 1
Data not fitting the linear model simply means that the data is not aligned in a straight line. Whether the data fits the linear model or not is determined by looking at standardized residual plots and conducting model diagnostics. This method of residual analysis is quite ingenious, as it is designed to overcome the challenge of drawing lines in higher dimensions. It is fair to say that the very reason for calculating residuals in the first place is because of this.
If the following four conditions are met in the residual plots, the model diagnostic is considered to have passed:
- (i) Linearity: The residuals should be symmetrically distributed around $0$.
- This assumption is intrinsic to the nature of regression analysis, which aims to find a straight line. If linearity is not met, the exercise becomes pointless. Typically, linearity is easily satisfied in actual analysis since the use of regression analysis is predicated on the presumption of its presence.
- (ii) Homoscedasticity: The distribution of residuals should be uniform.
- If the variance suddenly decreases or increases in specific intervals, it implies that the data might not have been obtained through the same process. Issues such as differences in data collectors or errors need to be considered. If there are issues like increasing or decreasing variance towards the end, partial remedies can be sought through variable transformation.
- (iii) Independency: There should be no pattern in the residuals.
- A pattern in the residuals suggests that the errors are not completely random, contradicting the assumption of regression analysis. Lack of independency implies that there might be an unknown rule, such as autocorrelation. In such cases, it is better to look for more appropriate tools like time series analysis rather than attempting convoluted solutions. If the issue is severe, it becomes apparent at first glance but if not, it might not be a significant concern. Caution should be taken not to misuse the Durbin-Watson test for checking independency, as it is designed to detect autocorrelation in residuals at fixed intervals and not to confirm independency. One should not blindly trust in independency just because it passed the Durbin-Watson test when there is an apparent trend.
- (iv) Normality: The residuals should appear to follow a standard normal distribution.
- Unlike other assumptions, normality can be objectively diagnosed through tests like the Shapiro-Wilk test or Jarque-Bera test. However, the issue is not always straightforward, mainly because outliers often significantly impact normality. If the analyst can directly observe and explain the phenomena represented by the outliers, it’s not a major issue. It’s important not to hastily remove outliers. For instance, if there are $300$ samples and approximately $3$ outliers deviating beyond six sigma ($\pm 3 \sigma$), it’s considered normal. Having too many outliers is problematic, but having too few can also indicate that the data does not truly follow a normal distribution.
These four conditions are not randomly ordered but are laid out in order of importance, and understanding this order can be gleaned from the theoretical derivation of hypothesis testing for regression coefficients.2 In actual statistical analysis, not all data presents itself neatly, requiring compromise on certain conditions at times. In such instances, some degree of deviation from normality, due to a high number of outliers or slight bias, can be tolerated.
A significant part of model diagnostics relies on visual inspection, requiring a thorough understanding of the data. Identifying the incorrect parts is the first issue; how to address them is the second. The best way to develop this skill is to engage in as many real analyses as possible, exposing oneself to a variety of cases.
Code
Below is the R code that outputs the residual plots.
out<-lm(rating~.,data=attitude); summary(out)
win.graph(5,5); plot(rstudent(out),main="표준화된 잔차")