library(tidyr)
Missing Values - NA
packages
NA/NULL
One common scenario when working with real-world data is that we want to extract all elements of a vector that are not NA (i.e. missing data).
Recall that is.na(x) yields a vector of logical values the same length as x
- with TRUEs corresponding to NA values in x
- FALSEs corresponding to non-NA values in x
Data
Create vector
If you remember how to create a vector
<- c(44, NA, 5, NA)
x * 3 x
[1] 132 NA 15 NA
rnorm
The random distribution funtion used to create a sample. rnorm and sample are used here to generate data, will explain them in another post.
#lets create a vector containing 1000 draws from a standard normal distribution with
<- rnorm(1000)
y
# let's create a vector containing 1000 NAs with
<- rep(NA, 1000)
z
#let's select 100 elements at random from these 2000 values
<- sample(c(y, z), 100)
my_data my_data
[1] NA NA 2.76552461 NA NA 1.21716498
[7] 0.80532026 NA NA -0.28116520 1.53568712 2.76295041
[13] NA -0.25878755 2.36316431 0.54279804 -1.10242431 NA
[19] NA NA NA NA NA -1.06500181
[25] 1.75205988 NA -1.72808148 1.16218401 NA NA
[31] NA NA NA 0.24552302 -0.14939510 -0.52561624
[37] 0.37860627 1.64047423 NA 1.43886132 -0.60944337 -1.23574959
[43] 0.94679565 NA NA 2.52863837 -0.31885970 NA
[49] NA -0.20846389 -0.21922585 NA NA NA
[55] -0.60070487 -0.06217002 NA NA -0.09281883 0.82695007
[61] NA NA 1.63159043 -0.09032553 1.29332147 1.46937981
[67] 2.04020180 -1.05453297 -0.09270116 NA NA NA
[73] 1.82470402 NA NA NA NA NA
[79] NA 0.38708450 2.66791134 NA NA NA
[85] NA NA NA NA NA 0.42807450
[91] NA 0.40769939 -0.05880791 -0.71321360 0.68451529 NA
[97] 1.19420064 NA -0.97583876 0.78379838
Look for NA
is.na
Where are our NAs located in our data? The is.na() function tells us whether each element of a vector is NA
<- is.na(my_data)
my_na my_na
[1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
[13] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
[25] FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
[37] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
[49] TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
[61] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
[73] FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
[85] TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
[97] FALSE TRUE FALSE FALSE
count
- Another way of counting how many rows have NA values is by using the sum() function. Since TRUE and FALSE are represented by 1 and 0 using the sum(of true values) would give us the count.
- Underneath the surface, R represents TRUE as the number 1 and FALSE as the number 0.
- Therefore, if we take the sum of a bunch of TRUEs and FALSEs, we get the total number of TRUEs or the COUNT of NA or non-NA values/rows based on how it’s used
sum(my_na)
[1] 51
Extract
!is.na
- Just as we learned to use is.na we can use !is.na which is the negativity of it to extract what IS NOT NA from the sample and store it in non_na
<- my_data[!is.na(my_data)]
non_na non_na
[1] 2.76552461 1.21716498 0.80532026 -0.28116520 1.53568712 2.76295041
[7] -0.25878755 2.36316431 0.54279804 -1.10242431 -1.06500181 1.75205988
[13] -1.72808148 1.16218401 0.24552302 -0.14939510 -0.52561624 0.37860627
[19] 1.64047423 1.43886132 -0.60944337 -1.23574959 0.94679565 2.52863837
[25] -0.31885970 -0.20846389 -0.21922585 -0.60070487 -0.06217002 -0.09281883
[31] 0.82695007 1.63159043 -0.09032553 1.29332147 1.46937981 2.04020180
[37] -1.05453297 -0.09270116 1.82470402 0.38708450 2.66791134 0.42807450
[43] 0.40769939 -0.05880791 -0.71321360 0.68451529 1.19420064 -0.97583876
[49] 0.78379838
Remove
As part of tidyr: “drop_na() drops rows where any column specified by … contains a missing value” ### drop_na
drop_na(data, ...)
- data is data frame
- … Columns to inspect for missing values. If empty, all columns are used.
- In other words it only keeps the COMPLETE ROWS (where no rows contain missing values)
- Similar in a sense to complete.cases()
library(tidyr)
<- data.frame(x = c(1,2, NA), y = c("a", NA, "b"))
df #__________This will drop NA from all columns
<- df |> drop_na()
df1 df1
x y
1 1 a
#__________To remove NA from column x
<- df |> drop_na(x)
df2 df2
x y
1 1 a
2 2 <NA>
You have to be cautious as to when to use an operation on df if you are using any of the functions that remove NA. For example we might want to calculate the mean of column (x). In that case we don’t use drop_na() because it will remove rows from x that have values. See example below in na.omit()
na.omit()
If df is oz, self explanatory really, this function just omits all rows that have NA as value. So similar to drop_na it omits the entire rows that have at least one NA
oz <- na.omit(oz)
Here is another way to use na.omit(), inside another function.
- Let’s say we want to bind two dfs together and then
- we want to calculate the mean of column (x)
- If my dataset has millions of rows, I prefer to remove the NA from the column I will use to calculate the mean on before I bind them together.
- If you have a few thousands rows and a couple columns you might want to consider the time element of running the operation is all I’m saying.
# Let's say we have 2 dfs and want to calculate the sum() of column x, we also need to remove NAs from it prior
<- data.frame(x = c(1,2, NA), y = c("a", NA, "b"))
df1 <- data.frame(x = c(24,NA, -12), y = c(NA, 4, "z"))
df2 df1
x y
1 1 a
2 2 <NA>
3 NA b
df2
x y
1 24 <NA>
2 NA 4
3 -12 z
# Let's rbind the two while we remove NA values and take the sum of all what's left in x
<- na.omit(rbind(df1,df2))
bound bound
x y
1 1 a
6 -12 z
sum(bound$x)
[1] -11
Look in theHow To page for a real project using both na.omit() [mean function example] and complete.cases() in the [complete case example] ### na.rm = TRUE
This once again is another way to remove NA from your data. This could be used inside another function, just like you can use na.omit() inside another function.. Let’s say we want to get the sum() or mean() or whatever of column (x), omitting NA values in that specific column ONLY, column (x). Here is a short way to do it. As you see the value is not -11.
#___________Same dfs as before, we'll rbind them first, then we'll sum column(x) after na.rm=TRUE is executed
<- data.frame(x = c(1,2, NA), y = c("a", NA, "b"))
df1 <- data.frame(x = c(24,NA, -12), y = c(NA, 4, "z"))
df2 <- rbind(df1,df2)
boundall boundall
x y
1 1 a
2 2 <NA>
3 NA b
4 24 <NA>
5 NA 4
6 -12 z
# see the value of boundall compared to bound above
sum(boundall$x, na.rm = TRUE)
[1] 15
complete.cases
complete.cases(…) returns a logical vector indicating which cases are complete, i.e., have no missing values. Args: a sequence of vectors, matrices and data frames
#_________Use same data as above
<- data.frame(x = c(1,2, NA), y = c("a", NA, "b"))
df1 <- data.frame(x = c(24,NA, -12), y = c(NA, 4, "z"))
df2 df1
x y
1 1 a
2 2 <NA>
3 NA b
df2
x y
1 24 <NA>
2 NA 4
3 -12 z
#__Let's look at what we get for each if we run complete.cases()
<- complete.cases(df1)
ok1 <- complete.cases(df2)
ok2
#__________So in df1 only 1 complete case, and same in df2
+ ok2 ok1
[1] 1 0 1
#____________if we bind them we get
<- rbind(ok1,ok2)
bind_them bind_them
[,1] [,2] [,3]
ok1 TRUE FALSE FALSE
ok2 FALSE FALSE TRUE
- We only want rows that are not missing values?
- We did it above with na.omit can we do it with complete cases?
- So let’s go back to the beginning and bind df1 & df2 into boundall, then
- We use df[complete.case(df), ] that tells R we want to scan through the complete.cases rows and all columns
<- rbind(df1,df2)
boundall boundall
x y
1 1 a
2 2 <NA>
3 NA b
4 24 <NA>
5 NA 4
6 -12 z
<- boundall[complete.cases(boundall), ]
clean_bind clean_bind
x y
1 1 a
6 -12 z
na.fail
Returns the object if it does not contain any missing values, and signals an error otherwise. Just as na.omit returns the object with incomplete cases removed. na.pass returns the object unchanged.