Remove NA in R
Overview
NA stands for Not Available in R programming, primarily indicating ‘missing values’. It is important to note that its meaning and usage differ entirely from null in other programming languages. Textbook examples are usually well-organized for analysis, but real-world data analysis often does not reflect this neatness. Handling such data frequently involves dealing with missing values. These missing values are either dealt with by imitating similarly trending data through imitation or by direct removal.
As always with statistics, there is no one-size-fits-all method, and these methods themselves are subjects of research. This post cannot cover everything so grand, but will introduce simple ways to handle missing values in R.
Tips
Below is the appearance of the example file example526.csv
after loading.
It can be observed that cells with no input values are displayed as the missing value NA without any special settings.
Upon inspecting the structure of the data, one can notice that X1, seemingly continuous data, appears as categorical. This is because X1 contains the string “?” which is essentially ’not empty but missing’.
na.strings
In such cases, when loading the file, one can use the na.strings
option to specify a vector of strings to be treated as missing values. Once “?” is designated as a missing value in the code, one can verify that the structure of the data appears normal. Meanwhile, at the bottom of X2
, there’s -99
, a strikingly odd value. Compared with other observations, it’s not just an outlier but likely signifies a missing value.
Of course, one could extend the exceptions with na.strings
as done before, but when the numbers are minimal or the types are many, it’s inconvenient.
Therefore, one can use the method of directly overwriting with NA under certain conditions, as shown above. Since there were no extremely large negative values other than -99, changing all values less than 0 to NA was the approach taken. It might seem like a stopgap measure, but looked at another way, it’s a precise adjustment dealing with only the necessary parts.
Being freely able to use such minor and seemingly trivial techniques is a skill in itself.
is.na
Now, by using the is.na()
function for verification, it’s clear that the slightly odd values have all changed to NA.
In most cases, there’s no need for observations containing these missing values, so what remains is their removal.
There are various creative ways to do this, but R kindly provides a basic function for this purpose.
na.omit
Simply inserting into the na.omit()
function removes all observations containing even one NA cleanly. Lines 6, 9, and 10 disappear cleanly. These aren’t particularly astonishing tips, but if not well known or improperly used, one can end up losing a tremendous amount of time even before starting the actual analysis. It’s crucial to make them one’s own and use them effectively over time.
Code
Below is the example code.
example526<-read.csv(file = file.choose()); example526
str(example526)
example526<-read.csv(file = file.choose(),na.strings = c("?")); example526
str(example526)
example526[example526<0]<-NA; example526
is.na(example526)
example526
na.omit(example526)