How to Interpret Multiple Regression Analysis Results in R 📂Statistical Analysis

How to Interpret Multiple Regression Analysis Results in R

Data Exploration

tail(attitude)

In R, let’s load the built-in data attitude and check it using the tail() function. We are interested in performing multiple regression analysis on this data.

$20180825\_144319.png$

We are interested in how the other independent variables affect the rating, which is our dependent variable. It’s difficult to see if there is a linear relationship between rating and the other variables just by looking at the data, so let’s draw a graph to check.

win.graph()
plot(attitude)

Simply putting the data into the plot() function and running it will output scatter plots comparing each variable.

$20181226\_134511.png$

At first glance, rating seems to have a clear linear relationship with complaints. For others like learning, and raises, they seem to form a linear relationship, but they are more scattered compared to complaints. Just like in simple regression analysis, we can input a linear model into the lm() function and see the results using the summary() function.

Results Interpretation

In the code below, columns 1~2 and 3 are exactly the same expression. The dot(.) input when entering the linear model means ‘all other variables’. [ NOTE: On the other hand, it’s also possible to input rating~.-privileges to exclude only privileges. ]

out<-lm(rating~complaints+privileges+learning
   +raises+critical+advance,data=attitude)
out<-lm(rating~.,data=attitude)
 
summary(out)

$20180825\_150719.png$

The mentioning ‘it’s okay not to know’ doesn’t mean you really don’t need to know; it means that these aren’t the indicators to focus on immediately when hurriedly studying and reading results.

(1) Residuals

It’s fine to just look at whether other percentiles appear symmetrically around the median(median). Since a model diagnosis cannot be completed with percentiles alone, it’s not very meaningful after all.

(2) Estimate

These are regression coefficients for each variable, where (Intercept) represents $y$ the intercept of the regression line, and the rest represents the rate of unit change per variable. Different from simple regression analysis, because there are various independent variables, the results change according to the significance level. At significance level $5 \%$ , the regression coefficient of learning doesn’t reject the null hypothesis $0$ , $\text{(rating)} = 0.61319 \cdot \text{(complaints)} + \varepsilon$ whereas at significance level $10 %$ , the model can be reduced as follows: $\text{(rating)} = 0.61319 \cdot \text{(complaints)} + 0.32033 \cdot \text{(learning)} + \varepsilon$

(3) Std. Error

It’s okay not to know. It’s the Standard Error of the estimate, which can be used to find the confidence interval of the regression coefficient.

(4) t-value

It’s okay not to know. This is the value obtained by dividing the estimate by its Standard Error, becoming a test statistic that follows the t-distribution with degrees of freedom $n-p-1$ . It’s used to conduct a hypothesis test to see if the regression coefficients are statistically significant.

(5) p-value pr(>|t|)

If this is small, it means the regression coefficient is significant, and it’s better the smaller it is if you want to show there’s a correlation. Depending on how small this value is, the Signif. codes below will be marked accordingly. Typically, the significance level is set at 5%, so having even one dot marked can be considered as having a correlation. If this value is large and shows no regression relationship, it means that how the regression coefficient was determined is statistically meaningless. Since this p-value comes from the t-distribution, it’s appropriate to use the expression pr(>|t|).

(6) Adjusted R-squared

This measure indicates how well the analysis explains the data relative to the number of variables, with a higher value being better. As the number of variables in regression analysis increases, it means that more data can be used, and accordingly, the explanatory power $R^{2}$ always increases. The difference between adjusted explanatory power and explanatory power is that, unlike $\displaystyle R^2 = 1 - {{ \text{ SSE } } \over { \text{ SST} }}$ , $\displaystyle R^{2}_{a} = 1 - {{ \text{ SSE } / (n - p - 1) } \over { \text{ SST} / (n - 1) }}$ it’s calculated to reflect the number of variables. In actual applied mathematics, increasing the number of variables means increasing costs, and naturally, a way to reduce this must be found. The adjusted explanatory power $R^{2}_{a}$ applies a penalty in terms of the number of variables, thereby correcting the explanatory power and being used to compare models. If increasing the number of variables doesn’t significantly increase the adjusted explanatory power, it’s considered a waste.

(7) F-Statistic

If the p-value of this is small, it indicates significant regression coefficients, and it’s better the smaller it is if you want to show there is a correlation. While the t-test for regression coefficients is a test for each regression coefficient, the F-test is a test for the regression analysis itself. Leaving aside the p-value and talking about the F-statistic itself, it’s handy for comparing models.

It’s important to note that just being able to view these results does not mean the end. Multiple regression analysis is much more difficult than simple regression analysis because the introduction of more independent variables brings problems that were not present in simple regression analysis. There are various verification procedures that remain, which are difficult to explain with numerical calculations alone, and being good at these is essentially what constitutes proficiency at the undergraduate level of regression analysis.

Code

Below is the full example code written in R.

tail(attitude)
 
win.graph()
plot(attitude)
 
out<-lm(rating~complaints+privileges+learning
   +raises+critical+advance,data=attitude)
out<-lm(rating~.,data=attitude)
 
summary(out)