Missing Values - NA

packages

library(tidyr)

NA/NULL

One common scenario when working with real-world data is that we want to extract all elements of a vector that are not NA (i.e. missing data).

Recall that is.na(x) yields a vector of logical values the same length as x

with TRUEs corresponding to NA values in x
FALSEs corresponding to non-NA values in x

Data

Create vector

If you remember how to create a vector

x <- c(44, NA, 5, NA)
x * 3

[1] 132  NA  15  NA

rnorm

The random distribution funtion used to create a sample. rnorm and sample are used here to generate data, will explain them in another post.

#lets create a vector containing 1000 draws from a standard normal distribution with
y <- rnorm(1000)

# let's create a vector containing 1000 NAs with
z <- rep(NA, 1000) 

#let's select 100 elements at random from these 2000 values
my_data <- sample(c(y, z), 100)
my_data

  [1]          NA          NA  2.76552461          NA          NA  1.21716498
  [7]  0.80532026          NA          NA -0.28116520  1.53568712  2.76295041
 [13]          NA -0.25878755  2.36316431  0.54279804 -1.10242431          NA
 [19]          NA          NA          NA          NA          NA -1.06500181
 [25]  1.75205988          NA -1.72808148  1.16218401          NA          NA
 [31]          NA          NA          NA  0.24552302 -0.14939510 -0.52561624
 [37]  0.37860627  1.64047423          NA  1.43886132 -0.60944337 -1.23574959
 [43]  0.94679565          NA          NA  2.52863837 -0.31885970          NA
 [49]          NA -0.20846389 -0.21922585          NA          NA          NA
 [55] -0.60070487 -0.06217002          NA          NA -0.09281883  0.82695007
 [61]          NA          NA  1.63159043 -0.09032553  1.29332147  1.46937981
 [67]  2.04020180 -1.05453297 -0.09270116          NA          NA          NA
 [73]  1.82470402          NA          NA          NA          NA          NA
 [79]          NA  0.38708450  2.66791134          NA          NA          NA
 [85]          NA          NA          NA          NA          NA  0.42807450
 [91]          NA  0.40769939 -0.05880791 -0.71321360  0.68451529          NA
 [97]  1.19420064          NA -0.97583876  0.78379838

Look for NA

is.na

Where are our NAs located in our data? The is.na() function tells us whether each element of a vector is NA

my_na <- is.na(my_data)
my_na

  [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
 [13]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
 [25] FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
 [37] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE
 [49]  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
 [61]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
 [73] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
 [85]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [97] FALSE  TRUE FALSE FALSE

count

Another way of counting how many rows have NA values is by using the sum() function. Since TRUE and FALSE are represented by 1 and 0 using the sum(of true values) would give us the count.
Underneath the surface, R represents TRUE as the number 1 and FALSE as the number 0.
Therefore, if we take the sum of a bunch of TRUEs and FALSEs, we get the total number of TRUEs or the COUNT of NA or non-NA values/rows based on how it’s used

sum(my_na)

[1] 51

Extract

!is.na

Just as we learned to use is.na we can use !is.na which is the negativity of it to extract what IS NOT NA from the sample and store it in non_na

non_na <- my_data[!is.na(my_data)]
non_na

 [1]  2.76552461  1.21716498  0.80532026 -0.28116520  1.53568712  2.76295041
 [7] -0.25878755  2.36316431  0.54279804 -1.10242431 -1.06500181  1.75205988
[13] -1.72808148  1.16218401  0.24552302 -0.14939510 -0.52561624  0.37860627
[19]  1.64047423  1.43886132 -0.60944337 -1.23574959  0.94679565  2.52863837
[25] -0.31885970 -0.20846389 -0.21922585 -0.60070487 -0.06217002 -0.09281883
[31]  0.82695007  1.63159043 -0.09032553  1.29332147  1.46937981  2.04020180
[37] -1.05453297 -0.09270116  1.82470402  0.38708450  2.66791134  0.42807450
[43]  0.40769939 -0.05880791 -0.71321360  0.68451529  1.19420064 -0.97583876
[49]  0.78379838

Remove

As part of tidyr: “drop_na() drops rows where any column specified by … contains a missing value” ### drop_na

drop_na(data, ...)

data is data frame
… Columns to inspect for missing values. If empty, all columns are used.
In other words it only keeps the COMPLETE ROWS (where no rows contain missing values)
Similar in a sense to complete.cases()

library(tidyr)
df <- data.frame(x = c(1,2, NA), y = c("a", NA, "b"))
#__________This will drop NA from all columns
df1 <- df |> drop_na() 
df1

  x y
1 1 a

#__________To remove NA from column x
df2 <- df |> drop_na(x)
df2

  x    y
1 1    a
2 2 <NA>

You have to be cautious as to when to use an operation on df if you are using any of the functions that remove NA. For example we might want to calculate the mean of column (x). In that case we don’t use drop_na() because it will remove rows from x that have values. See example below in na.omit()

na.omit()

If df is oz, self explanatory really, this function just omits all rows that have NA as value. So similar to drop_na it omits the entire rows that have at least one NA

oz <- na.omit(oz)

Here is another way to use na.omit(), inside another function.

Let’s say we want to bind two dfs together and then
we want to calculate the mean of column (x)
If my dataset has millions of rows, I prefer to remove the NA from the column I will use to calculate the mean on before I bind them together.
If you have a few thousands rows and a couple columns you might want to consider the time element of running the operation is all I’m saying.

# Let's say we have 2 dfs and want to calculate the sum() of column x, we also need to remove NAs from it prior
df1 <- data.frame(x = c(1,2, NA), y = c("a", NA, "b"))
df2 <- data.frame(x = c(24,NA, -12), y = c(NA, 4, "z"))
df1

   x    y
1  1    a
2  2 <NA>
3 NA    b

df2

    x    y
1  24 <NA>
2  NA    4
3 -12    z

# Let's rbind the two while we remove NA values and take the sum of all what's left in x
bound <- na.omit(rbind(df1,df2))
bound

    x y
1   1 a
6 -12 z

sum(bound$x)

[1] -11

Look in theHow To page for a real project using both na.omit() [mean function example] and complete.cases() in the [complete case example] ### na.rm = TRUE

This once again is another way to remove NA from your data. This could be used inside another function, just like you can use na.omit() inside another function.. Let’s say we want to get the sum() or mean() or whatever of column (x), omitting NA values in that specific column ONLY, column (x). Here is a short way to do it. As you see the value is not -11.

#___________Same dfs as before, we'll rbind them first, then we'll sum column(x) after na.rm=TRUE is executed
df1 <- data.frame(x = c(1,2, NA), y = c("a", NA, "b"))
df2 <- data.frame(x = c(24,NA, -12), y = c(NA, 4, "z"))
boundall <- rbind(df1,df2) 
boundall

    x    y
1   1    a
2   2 <NA>
3  NA    b
4  24 <NA>
5  NA    4
6 -12    z

# see the value of boundall compared to bound above
sum(boundall$x, na.rm = TRUE)

[1] 15

complete.cases

complete.cases(…) returns a logical vector indicating which cases are complete, i.e., have no missing values. Args: a sequence of vectors, matrices and data frames

#_________Use same data as above
df1 <- data.frame(x = c(1,2, NA), y = c("a", NA, "b"))
df2 <- data.frame(x = c(24,NA, -12), y = c(NA, 4, "z"))
df1

   x    y
1  1    a
2  2 <NA>
3 NA    b

df2

    x    y
1  24 <NA>
2  NA    4
3 -12    z

#__Let's look at what we get for each if we run complete.cases() 
ok1 <- complete.cases(df1)
ok2 <- complete.cases(df2)

#__________So in df1 only 1 complete case, and same in df2
ok1 + ok2

[1] 1 0 1

#____________if we bind them we get 
bind_them <- rbind(ok1,ok2)
bind_them

     [,1]  [,2]  [,3]
ok1  TRUE FALSE FALSE
ok2 FALSE FALSE  TRUE

We only want rows that are not missing values?
We did it above with na.omit can we do it with complete cases?
So let’s go back to the beginning and bind df1 & df2 into boundall, then
We use df[complete.case(df), ] that tells R we want to scan through the complete.cases rows and all columns

 boundall <- rbind(df1,df2)
boundall

    x    y
1   1    a
2   2 <NA>
3  NA    b
4  24 <NA>
5  NA    4
6 -12    z

clean_bind <- boundall[complete.cases(boundall), ]
clean_bind

    x y
1   1 a
6 -12 z

na.fail

Returns the object if it does not contain any missing values, and signals an error otherwise. Just as na.omit returns the object with incomplete cases removed. na.pass returns the object unchanged.