logo

How to Read Logistic Regression Results in R 📂Statistical Analysis

How to Read Logistic Regression Results in R

Practice

Internal Data Let’s load turnout data.

20190119_112220.png

turnout is data pertaining to the 1992 U.S. general elections, which can identify the vote (whether voted) based on race, age, education level (educate), and income. Since this data is interested in the dependent variable of whether or not someone voted, logistic regression can be used.

Logistic regression, unlike general regression analysis, builds a model through the glm() function. By inserting family=binomial() as an option, it automatically performs logistic regression. The reason ‘binomial’ suddenly appears is because the dependent variable is a binomial variable of either 0 or 1.

20190119_112300.png

20190119_113117.png

Result Interpretation

The results show many differences and similarities with general multiple regression analysis. The method of viewing the regression coefficient for each variable is the same, but the F-test is gone, and something called deviance has appeared. Deviance can be seen as a measure of how poorly the logistic regression model explains the data. Null deviance is the deviance when there are no variables, only an intercept term, representing the worst situation with absolutely no data. The obtained residual deviance is better the smaller it is, and since it follows a chi-square distribution, the model’s fitness can be confirmed through chi-square goodness-of-fit test. Although the name of the glm() function means ‘generalized regression analysis’, it essentially involves regression analysis, so multicollinearity cannot be overlooked. Fortunately, VIF can be easily calculated using the vif() function. The example seems to have no multicollinearity issues.

Model Diagnostics

On the other hand, the residual plot is quite shocking, but luckily, in logistic regression analysis, the residual plot does not hold any significance. Therefore, there’s no need to check it, and even if it seems weird, there’s no need to worry. The aforementioned goodness-of-fit test is sufficient.

Now, let’s actually check if the model fits using a few methods. 20190119_124433.png

The method for conducting a chi-square goodness-of-fit test is very simple. If you want to test against significance level $\alpha$, you can calculate the threshold by inputting $(1 - \alpha)$ and the degrees of freedom of residual deviance into the qchisq() function. If the residual deviance is less than the threshold like above, the model is considered fitting. 20190119_124452.png

Hosmer-Lemeshow goodness-of-fit test is a representative goodness-of-fit test used in logistic regression analysis, which also tests the model’s fitness through the chi-square statistic. The hoslem.test() function of the ResourceSelection package allows for testing by inputting the model’s actual dependent variable and fit. The null hypothesis is ’the model fits’, so if the significance probability is high and the null hypothesis cannot be rejected like above, the model is considered to fit. However, according to Frank Harrell, it has several weaknesses and is no longer recommended1.

20190119_131735.png

The lrm() function of the rms package performs a likelihood ratio test of the model. Unlike the previous two methods where the null hypothesis is ’the model fits’, in likelihood ratio testing, rejecting the null hypothesis means the model is considered fitting. If you are already using the rms package, it’s fine to directly input the model into the lrm() function without creating a logistic regression model with glm() function and passing its output. For example, by entering lrm(vote~.,data=turnout), the entire process can be performed at once.

Prediction

Finally, let’s calculate the probability directly using the logistic regression model. The probability can be simply calculated by adding the type='response' option in the predict() function. 20190119_122236.png

Of course, since we already have the data and know the true value, such prediction doesn’t mean much. Hence, it’s necessary to provide new data via the newdata option and check if it still predicts accurately.

Code

Here is the example code.

install.packages("Zelig")
install.packages("car")
install.packages("ResourceSelection")
install.packages('rms')
 
library(Zelig)
data(turnout); head(turnout)
out0<-glm(vote~.,family=binomial(),data=turnout); summary(out0)
 
library(car)
vif(out0)
 
win.graph(4,4); plot(out0$residuals,main="잔차그림")
 
qchisq(0.95,df=1995)
 
library(ResourceSelection)
hoslem.test(out0$y,fitted(out0))
 
library(rms)
lrm(out0)
lrm(vote~.,data=turnout)
 
predict(out0,type='response\')

See Also