Resolving not defined because of singularities in R Regression Analysis
If you are a major in statistics or mathematics, it is strongly recommended not to stop at roughly identifying the cause and solving the faced problem but also to understand the mathematical proof.
Error
Diagnosis
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5723 0.1064 5.381 4.98e-05 ***
최고기온 -0.3528 0.1490 -2.368 0.030 *
최저기온 0.2982 0.1955 1.525 0.146
일교차 NA NA NA NA
When performing regression analysis in R, there can be an issue with estimating coefficients accompanied by the message not defined because of singularities
.
Cause
- The design matrix $X$ does not have full rank. Therefore, when calculating the least squares solution, the inverse matrix in $X^{T} X$ does not exist. Hence the error message mentioning ‘singularities’.
- Simply put, it means there is multicollinearity.
- Even more simply put, the independent variables are not independent.
The necessary and sufficient condition for the existence of the inverse matrix in $X^T X$ is that the matrix in $X \in \mathbb{R}^{m \times n}$ has full rank $m \ge n$. $$ \exists \left( X^{T} X \right)^{-1} \iff \text{rank} X = n $$
This is a common mistake often encountered by beginners as depicted in the example, who initially face this issue while attempting to create derived variables using some ingenuity. Upon understanding the reason, it is mathematically obvious, but such errors are naturally possible for students whose statistical intuition is just beginning to take shape.
This is not a mistake encountered by you alone; others have faced it too, and I have faced it as well. The important thing is, upon solving this issue, to agree on the importance of mathematics—especially matrix algebra and linear algebra—and to use this as motivation for thorough theoretical study.
A typical scenario can be imagined as follows:
> data = as.data.frame(matrix(runif(60),20,3))
> names(data) <- c("감기확률", "최고기온", "최저기온")
> lm(감기확률 ~ 최고기온 + 최저기온, data = data)
Call:
lm(formula = 감기확률 ~ 최고기온 + 최저기온, data = data)
Coefficients:
(Intercept) 최고기온 최저기온
0.5723 -0.3528 0.2982
For example, suppose you have data related to temperature to explain the probability of catching a cold.
> data$일교차 <- (data$최고기온 - data$최저기온)
> out <- lm(감기확률 ~ 최고기온 + 최저기온 + 일교차, data = data)
However, as commonly known, the season when colds are prevalent is more related to the significant temperature difference between day and night rather than just low temperatures. Thus, if you include the temperature difference as a derived variable, you get the following results.
> summary(out)
Call:
lm(formula = 감기확률 ~ 최고기온 + 최저기온 + 일교차, data = data)
Residuals:
Min 1Q Median 3Q Max
-0.41562 -0.03316 0.00506 0.10834 0.35714
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5723 0.1064 5.381 4.98e-05 ***
최고기온 -0.3528 0.1490 -2.368 0.030 *
최저기온 0.2982 0.1955 1.525 0.146
일교차 NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1988 on 17 degrees of freedom
Multiple R-squared: 0.2779, Adjusted R-squared: 0.193
F-statistic: 3.271 on 2 and 17 DF, p-value: 0.06281
Solution
If possible, remove the new derived variables, and if you definitely want to keep them, you should remove the original independent variables used to create the derived variables. If it makes intuitive sense, you can also change the method of creating the derived variables by applying non-linear functions and the like.
Code
data = as.data.frame(matrix(runif(60),20,3))
names(data) <- c("감기확률", "최고기온", "최저기온")
lm(감기확률 ~ 최고기온 + 최저기온, data = data)
data$일교차 <- (data$최고기온 - data$최저기온)
out <- lm(감기확률 ~ 최고기온 + 최저기온 + 일교차, data = data)
summary(out)
Environment
- OS: Windows 11
- julia: v4.1.1