logo

Remove NA in R 📂R

Remove NA in R

Overview

NA stands for Not Available in R programming, primarily indicating ‘missing values’. It is important to note that its meaning and usage differ entirely from null in other programming languages. Textbook examples are usually well-organized for analysis, but real-world data analysis often does not reflect this neatness. Handling such data frequently involves dealing with missing values. These missing values are either dealt with by imitating similarly trending data through imitation or by direct removal.

As always with statistics, there is no one-size-fits-all method, and these methods themselves are subjects of research. This post cannot cover everything so grand, but will introduce simple ways to handle missing values in R.

Tips

Below is the appearance of the example file example526.csv after loading.

20180413\_135239.png 20180413\_135244.png

It can be observed that cells with no input values are displayed as the missing value NA without any special settings.

Upon inspecting the structure of the data, one can notice that X1, seemingly continuous data, appears as categorical. This is because X1 contains the string “?” which is essentially ’not empty but missing’.

na.strings

20180413\_140116.png 20180413\_140107.png

In such cases, when loading the file, one can use the na.strings option to specify a vector of strings to be treated as missing values. Once “?” is designated as a missing value in the code, one can verify that the structure of the data appears normal. Meanwhile, at the bottom of X2, there’s -99, a strikingly odd value. Compared with other observations, it’s not just an outlier but likely signifies a missing value.

Of course, one could extend the exceptions with na.strings as done before, but when the numbers are minimal or the types are many, it’s inconvenient.

20180413\_141017.png

Therefore, one can use the method of directly overwriting with NA under certain conditions, as shown above. Since there were no extremely large negative values other than -99, changing all values less than 0 to NA was the approach taken. It might seem like a stopgap measure, but looked at another way, it’s a precise adjustment dealing with only the necessary parts.

Being freely able to use such minor and seemingly trivial techniques is a skill in itself.

is.na

20180413\_141747.png

Now, by using the is.na() function for verification, it’s clear that the slightly odd values have all changed to NA.

In most cases, there’s no need for observations containing these missing values, so what remains is their removal.

There are various creative ways to do this, but R kindly provides a basic function for this purpose.

na.omit

20180413\_142327.png 20180413\_142040.png

Simply inserting into the na.omit() function removes all observations containing even one NA cleanly. Lines 6, 9, and 10 disappear cleanly. These aren’t particularly astonishing tips, but if not well known or improperly used, one can end up losing a tremendous amount of time even before starting the actual analysis. It’s crucial to make them one’s own and use them effectively over time.

Code

Below is the example code.

example526<-read.csv(file = file.choose()); example526
str(example526)
 
example526<-read.csv(file = file.choose(),na.strings = c("?")); example526
str(example526)
 
example526[example526<0]<-NA; example526
 
is.na(example526)
 
example526
na.omit(example526)