logo

Resolving not defined because of singularities in R Regression Analysis 📂Statistical Analysis

Resolving not defined because of singularities in R Regression Analysis

If you are a major in statistics or mathematics, it is strongly recommended not to stop at roughly identifying the cause and solving the faced problem but also to understand the mathematical proof.

Error

Diagnosis

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.5723     0.1064   5.381 4.98e-05 ***
최고기온     -0.3528     0.1490  -2.368    0.030 *
최저기온      0.2982     0.1955   1.525    0.146
일교차            NA         NA      NA       NA

When performing regression analysis in R, there can be an issue with estimating coefficients accompanied by the message not defined because of singularities.

Cause

The necessary and sufficient condition for the existence of the inverse matrix in $X^T X$ is that the matrix in $X \in \mathbb{R}^{m \times n}$ has full rank $m \ge n$. $$ \exists \left( X^{T} X \right)^{-1} \iff \text{rank} X = n $$

This is a common mistake often encountered by beginners as depicted in the example, who initially face this issue while attempting to create derived variables using some ingenuity. Upon understanding the reason, it is mathematically obvious, but such errors are naturally possible for students whose statistical intuition is just beginning to take shape.

This is not a mistake encountered by you alone; others have faced it too, and I have faced it as well. The important thing is, upon solving this issue, to agree on the importance of mathematics—especially matrix algebra and linear algebra—and to use this as motivation for thorough theoretical study.

A typical scenario can be imagined as follows:

> data = as.data.frame(matrix(runif(60),20,3))
> names(data) <- c("감기확률", "최고기온", "최저기온")
> lm(감기확률 ~ 최고기온 + 최저기온, data = data)

Call:
lm(formula = 감기확률 ~ 최고기온 + 최저기온, data = data)

Coefficients:
(Intercept)     최고기온     최저기온
     0.5723      -0.3528       0.2982

For example, suppose you have data related to temperature to explain the probability of catching a cold.

> data$일교차 <- (data$최고기온 - data$최저기온)
> out <- lm(감기확률 ~ 최고기온 + 최저기온 + 일교차, data = data)

However, as commonly known, the season when colds are prevalent is more related to the significant temperature difference between day and night rather than just low temperatures. Thus, if you include the temperature difference as a derived variable, you get the following results.

> summary(out)

Call:
lm(formula = 감기확률 ~ 최고기온 + 최저기온 + 일교차, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.41562 -0.03316  0.00506  0.10834  0.35714

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.5723     0.1064   5.381 4.98e-05 ***
최고기온     -0.3528     0.1490  -2.368    0.030 *
최저기온      0.2982     0.1955   1.525    0.146
일교차            NA         NA      NA       NA
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1988 on 17 degrees of freedom
Multiple R-squared:  0.2779,    Adjusted R-squared:  0.193
F-statistic: 3.271 on 2 and 17 DF,  p-value: 0.06281

Solution

If possible, remove the new derived variables, and if you definitely want to keep them, you should remove the original independent variables used to create the derived variables. If it makes intuitive sense, you can also change the method of creating the derived variables by applying non-linear functions and the like.

Code

data = as.data.frame(matrix(runif(60),20,3))
names(data) <- c("감기확률", "최고기온", "최저기온")

lm(감기확률 ~ 최고기온 + 최저기온, data = data)

data$일교차 <- (data$최고기온 - data$최저기온)
out <- lm(감기확률 ~ 최고기온 + 최저기온 + 일교차, data = data)
summary(out)

Environment

  • OS: Windows 11
  • julia: v4.1.1