Filter - Subset


This page is one of the most used pages in R when you are dealing with preparing, cleaning, processing or wrangling your data (whatever you feel like choosing as a verb).

Just because one function is in one paragraph or subsection it doesn’t exclude it from being in another one. The headings are a general way of grouping these functions, most can be used interchangeably, but some cannot.

Many other functions that could be used to filter, extract, or subset your data could be found in the other posts following this one.

Filter


library(dplyr)
library(plyr)  # for match_df

Well selecting a smaller set from a larger set is considered filtering, and is also considered subsetting and we’ll cover that later in the page.

filter

Filtering your data before you plot allows you to focus on specific subsets of your data and gain more targeted insights. To do this, just include the dplyr filter() function in your ggplot syntax.

data |> filter(variable1 == "DS") |> 
        ggplot()

multiple conditions

Here is an example we filter using multiple conditions

onlineta_city_hotels <- filter(hotel_bookings,
                               (hotel=="City Hotel" &
                                hotel_bookings$market_segment=="Online TA"))

#______________________________can be written this way as well
onlineta_city_hotels_v2 <- hotel_bookings |> 
        filter(hotel=="City Hotel") |> 
        filter(market_segment=="Online TA")

filter & mean

Here we filter out rows with the 2 conditional values then calculate the mean

cond_all <- hw1 |>
        filter(hw1$Ozone > 31 & hw1$Temp > 90)
mean(cond_all$Solar.R)

filter & subset

Create a subset for the only state given by the user

stateonly <- filter(datain, State == state)
#note state is the argument passed into the function, subset which will cover later 
stateonly <- subset(datain, State == state)

multiple col

Let’s filter the stateonly subset to include Hospital Name, and the outcome given by user

subdata <- select(stateonly, c("State", "Hospital.Name", all_of(usethis)))  

delete rows

I will not delete any rows, instead I’ll create a new df excluding the rows with <=0 length, or if you want to go further and include another logical condition OR | where station names are = HQ QR you can be my guest. I will not do it, instead I will check the new df afterwards for station names == HQ QR and deal with that at the time if they exist. I believe including that condition in this step might cause results I am not ready for yet. We know for sure we only have 117+93= 210 rows so let’s focus on that now.

ride_length all_trips19_20_v2 <- all_trips19_20[all_trips19_20$ride_length > 0,]
dim(all_trips19_20_v2) 

delete multiple vars

  • Here we’ll remove rows when conditions from multiple columns are met
  • We use dim to see how many rows were deleted from the original df
all_trips19_20_v3 <- all_trips19_20_v2 
                [!(all_trips19_20_v2$start_station_name == "HQ QR" | 
                           all_trips19_20_v2$end_station_name == "HQ QR"),]
dim(all_trips19_20_v3)

exclude

What if we want all elements except a certain one or ones? we use the same as above with (-)

  • Having the (-) sign outside or inside the brackets will work
x[-c(2,10)]
x[c(-2,-10)]

extract

  • Even though we could use filter to create a new df based on the conditions we want met, which in essence is extracting.
  • We can use slice

top N - slice

We can choose the top N of values by processing our calculations, group, arrange in whatever order we want, then slice out the part we want. We can actually take out a middle piece look into offset - If we want to extract the 5 stations with the highest number of rides - Top 5 so to speak - There are other ways to do this, but here we get to learn all the steps together - We can group - Summarise() to show the count of rides per station - Arrange the result in descending order - Slice out the top N (5) in this case

top_stations <- trips_19_20 |> 
        group_by(start_station_id) |> 
        summarise(number_of_rides =n()) |> 
        arrange(desc(number_of_rides)) 
top_5 <- top_stations |> 
        slice(1:5)

select

  • Let’s say you are primarily interested in the following variables: ‘hotel’, ‘is_canceled’, and ‘lead_time’
  • Use select(..,..,..) and list all the wanted column names
  • Create a new data frame with just those columns, calling it `trimmed_df` by adding the variable names with bookings_df being the original data frame
trimmed_df <- bookings_df |> 
        select(hotel,is_canceled,lead_time )
        
oz <- hw1 |>      #we have 153 rows and we just select one column 
        select(Ozone)

select(-c())

  • You can see the use of this function in the code below
  • We can use select with -c() the negation of select to select all except what’s in the -c(…)

Subset


To extract a subset of objects in R. There are several operators you can use in R. We’ll see how to extract elements from a vector based on some conditions that we specify. - For example, we may only be interested in the first 20 elements of a vector, or only the elements that are not NA, or only those that are positive or correspond to a specific variable of interest. - X has been created and contains a random ordering of 20 numbers (from a standard normal distribution) and 20 NAs.

  • See subset section above as well. One easy way to select needed columns when the list is too long to use select and list all the columns we want, we can create a subset and use the select() function with a - to exclude the columns we DON’T WANT.
  • The code below creates an entire subset EXCLUDING the 6 columns listed inside the -c()
  • Or you can use the [] selection to do the same thing as you see in the code chunk next

multiple columns

trim_trips19_20 <- subset(trips19_20, select = -c(start_lat, start_lng, end_lat, end_lng, gender, birthyear)) 

# OR
outlist <- almostlist[c("State","Hospital.Name.y")] 

subset = filter

  • Create a subset for the only state given by the user using subset
stateonly <- filter(datain, State == state)

#   OR
stateonly <- subset(datain, State == state)

negation

  • We can subset data by excluding a specific a row that contains an undesirable value
  • In this case we don’t want rows that have the string value “Not Available”
gooddata <-  subset(subdata, subdata[3] != "Not Available")

subset & as

  • Here is a use of as. as well as subset together.
  • The second line starts with the inner subset(), so we are using r as data, and we are pulling the row where the column State to be equal to HI.
  • That row is now a subset with however how many columns it has.
  • From that subset we are pulling the value in column Hospital
  • Then we are outputting the value as character.
r <- rankall("heart attack", 4)
as.character(subset(r, State == "HI")$Hospital)

complete.cases

Return a logical vector indicating which cases are complete, i.e., have no missing values.

x<-c(1,2,NA,4,NA,5)
y<- c("a","b",NA,"d",NA,"f")
good<- complete.cases(x,y)
good
[1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE
x[good]
[1] 1 2 4 5
y[good]
[1] "a" "b" "d" "f"
x<-c(1,2,NA,4,NA,5)
y<- c("a","b",NA,"d",NA,NA)
good<- complete.cases(x,y)
good
[1]  TRUE  TRUE FALSE  TRUE FALSE FALSE
x[good]
[1] 1 2 4
y[good]
[1] "a" "b" "d"

drop column

This is the same as the subset listed above, you can use -c() to extract the unwanted column(s)

trim_trips19_20 <- trips19_20 |> 
        select(-c(start_lat, start_lng, end_lat, end_lng, gender, birthyear))

Here is another example to drop one column and use the same dataset name:

all_trips19_20 <- all_trips19_20 |> 
        select(-c(tripduration))

extract rows

  • So let’s create a new df with just those rows where the ride_length < 0
neg_length <- all_trips19_20[all_trips19_20$ride_length < 0, ]

extract certain values

zero_length <- all_trips19_20[all_trips19_20$ride_length == 0, ]

in

If we want to pull out the restaurants that are in two neighborhoods: Roland Par, and Homeland and assign them to the variable rest$nearme

rest$nearme = rest$neighborhood %in% c("Roland Park", "Homeland")
table(rest$nearme) 
FALSE     TRUE
1314       13

Another example

cnt0 <- subset(named_pm0, State.Code == 36 & county.site %in% both)
cnt1 <- subset(named_pm1, State.Code == 36 & county.site %in% both)

across

if_any

if_all

R code in dplyr verbs is generally evaluated once per group. Inside across() however, code is evaluated once for each combination of columns and groups. across(.cols, .fns, …, .names = NULL, .unpack = FALSE) if_any(.cols, .fns, …, .names = NULL) if_all(.cols, .fns, …, .names = NULL)

gdf <-
  tibble(g = c(1, 1, 2, 3), v1 = 10:13, v2 = 20:23) |> 
  group_by(g)

set.seed(1)

# Outside: 1 normal variate
n <- rnorm(1)
gdf |>  mutate(across(v1:v2, ~ .x + n))
# A tibble: 4 × 3
# Groups:   g [3]
      g    v1    v2
  <dbl> <int> <int>
1     1    10    20
2     1    11    21
3     2    12    22
4     3    13    23

Binary variables

If I want to count how many observations meet a logical condition I can have the results in True False values ### ifelse - I can create a binary variable with ifelse. - So what I get is a table if the zipcode is <0 or - Count of how many zip codes values are wrong

restData$zipWrong = ifelse(restData$zipcode < 0, TRUE, FALSE) 
table(restData$zipWrong, restData$zipcode < 0) 
        FALSE    TRUE
FALSE    1326     0 
TRUE       0      1

Categorical variables

cut

  • I can break up a list of variables using cut() which tells it I want to break it up based on a list of values I provide.
  • Let’s say I want to break down the zipcodes by quarters/quantile and assign it to ~Groups
  • group 1: 0-25%, group 2: 25-50%…. > It breaks a quantitative variable into a categorical variable
restData$zipGroups = cut(restData$zipCode, breaks = quantile(restData$zipCode))
table(restData$zipGroups)

(-2.123e+04, 2.12e+04]  (2.12e+04, 2.122e+04]  (2.122e+04, 2.123e+04]  (2.123e+04, 2.129e+04]
          337                    375                    282                      332

table

Table will use the first argument as the rows and the second as the column and it’ll bread them down that way

table(restData$zipGroups,restData$zipCode)

              -21226        21201   21205   21206    21207 ....
(-2.123e+04, 2.12e+04]      0         136     201      0       0    ...
(2.12e+04, 2.122e+04]       0           0       0      27      30    ..
(2.122e+04, 2.123e+04]     ....
(2.123e+04, 2.129e+04]    ...

cut2

An easier way to that, without specifying all the cuts manually like I did with cut, is to use cut2 from the Hmisc package. - If you go into the Hmisc package and you use the cut2 function, - You can actually specify I want to break the zip code up into four different groups and - I want to break them up according to the quantiles. - It will actually find out the quantiles for me and break it up into four groups according to the quantiles

library(Hmisc)
restData$zipGroups = cut2(restData$zipCode, g=4)
table(restData$zipGroups)

(-21226,21205)    (21205,21220)     (21220,21227)   (21227,21287)
      338              375               300            314

Create Factor Variables

factor

  • You might want to create factor variables. In other words, it’s not clear that incrementing the zip code by one changes quantitatively.
  • So, you might want to turn that into a factor variable and the way you do that is with the factor command
  • It takes an input of this integer variable, and it turns it into a factor variable
  • So, if I look at the first ten values of ZCF, it shows me these values and they look just like they did before
  • It tells me how many different zip code levels there are
  • There are 32 different zip code levels, and if I look at a class of this variable it’s a factor variable
restDafa$zcf <- factor(restData$zipCode)
restData$zcf[1:10] 
[1] 21206  21231  21224  21223  21228  21205  21211  21205  21231
32 levels: -21226  21201  21202  21205  21206  21207  21208  21209....

relevel

A couple of other things that you might wanna know about factor variables - Create just a dummy vector, so that we can see some other properties of factor variables - I’m gonna create a vector of yeses and nos, and so I’m gonna do that randomly, a size ten vector of yeses and nos - And I’m gonna turn that into a factor variable, by default, it’ll treat the lowest value alphabetically as the first value, the factor variable - Suppose I want to treat the second, the yes values as the lowest value - Then I can relevel that variable and make the reference be the =yes value - Then I’ll change the type and remember it’ll start with the lowest value and call it a 1 and the next lowest value a 2 and so on

yesno <- sample(c("yes", "no"), size=10, replace=TRUE)
yesnofac = factor(yesno, levels = c("yes", "no"))
relevel(yesnofac, ref = "yes")
 [1] yes yes no  yes yes yes no  no  yes yes
Levels: yes no
as.numeric(yesnofac)
 [1] 1 1 2 1 1 1 2 2 1 1

Check values

any==0

any(all_trips19_20$ride_length == 0)

any<0

ride_length any(all_trips19_20$ride_length < 0)

all

all(all_trips19_20$ride_length > 0)

is.na is.nan

Note that nan contains na, so is.na would find nan as well as na, but is.nan will only find nan values

any(is.na(all_trips19_20$ride_length))

!is.na

any(!is.na(all_trips19_20$ride_length))

which

Another way to omit NA values or to use a logical condition is to use which(). It returns a logical T/F for the condition. We know that if a value is NA it’s value is 0 so to check if a value>0 means we are overlooking NA. Below - We have a df - In the first block we filtered the df for when var3 > 11 - We can do the same to filter for rows that are NA, just use the logical comparison >0

> x[which(x$var3>11),]
x[which(x$var2>0),]

any combination

any(!is.na(all_trips19_20$ride_length) < 0)

check_for_negative_values

If we wanted to know which rows have a negative value in a specific column, one way is to install another package “fixr” if you want to identify the rows that contain the negative ride_length.

This function checks if a data frame contains negative values and returns their indices if any are found.

install.packages("fixr")
library(fixr)
check_for_negative_values(all_trips19_20$ride_length)

Data frame contains negative values.
[1] 520402 648667 651963 652290 653845 655288 655654 656362 657167 658842 660684 663630
[19] 669752 678262 678499 678889 679182 679252 680301 680757 682576 682662 682853 684254
[37] 694476 694635 696693 697774 698938 699115 699204 699450 699619 699855 702399 702914
[55] 714612 714687 714761 715171 715474 719275 719680 725559 726496 730499 731284 731789
[73] 742697 743604 743619 745898 747194 747233 748686 754006 754853 755032 755536 756753
[91] 763358 766631 767985 768102 768693 768833 772284 772868 774788 775777 776578 777009
[109] 779170 779496 781519 781699 786503 786687 788671 788946 789252

identify ALL <0 values

Let’s see if any other NEGATIVE values exist in the entire dataset. The output displays all the rows where a NEGATIVE value appears in a specific column. As you can see we have 117 rows, and the only column with negative values is column 10 which happens to be the ride_length column.

check_for_negative_values(all_trips19_20)

Data frame contains negative values.
row col
[1,] 520402  10
[2,] 648667  10
[3,] 651963  10 
[4,] 652290  10 ...
[115,] 788671  10 
[116,] 788946  10 
[117,] 789252  10

Identical


identical > identical(vect,vect2)
[1] TRUE

unique

Unique returns a vector with all duplicate elements removed. This is used quite a bit when viewing a data set for the first time using SQL, or when we want to group a data set by ID’s or some unique variable.

unique(c(3, 4, 5, 5, 5, 6, 6))
[1] 3 4 5 6

sapply

We can check for unique using sapply

sapply(unique_vals,length)
name   landmass       zone       area population   language   religion       bars   
194          6          4        136         48         10          8          5    
blue       gold      white      black     orange    mainhue    circles    crosses   
2          2          2          2          2          8          4          3         
triangle       icon    animate       text    topleft   botright  
2          2          2          2          7          8 

section

x[1:10]
 [1]  1  2 NA  4 NA  5 NA NA NA NA

Index vectors come in four different flavors – logical vectors, vectors of positive integers, vectors of negative integers, and vectors of character strings

na

One common scenario when working with real-world data is that we want to extract all elements of a vector that are not NA (i.e. missing data). Recall that is.na(x) yields a vector of logical values the same length as x, with TRUEs corresponding to NA values in x and FALSEs corresponding to non-NA values in x. 

So if we want to find all the NA values in a vector we can use, and we can find all the non-na values as well next by creating a vector y

x[is.na(x)]
[1] NA NA
y <- x[!is.na(x)]
y
[1] 1 2 4 5

sub positive values

y[y>0]
[1] 1 2 4 5
x[x>0]
[1]  1  2 NA  4 NA  5
x[!is.na(x) & x>0 ]
[1] 1 2 4 5

zero-based indexing

Many programming languages use what’s called ‘zero-based indexing’, which means that the first element of a vector is considered element 0.

R uses ‘one-based indexing’, which (you guessed it!) means the first element of a vector is considered element 1.

So how will we subset the 3, 5, and 7th element of x? Both worked but it appears that the second one is the way to go

c(x[3],x[5],x[7])
[1] NA NA NA
x[c(3,5,7)] 
[1] NA NA NA
#______________________what happens when we ask for out or range values? let's see
x[0]
numeric(0)
numeric(0)
numeric(0)

exclude

What if we want all elements except a certain one or ones? we use the same as above with -

#________________________both of these will do the trick
x[-c(2,10)]
[1]  1 NA  4 NA  5
x[c(-2,-10)]
[1]  1 NA  4 NA  5

names

# Create a numeric vector with three named elements
vect <- c(foo = 11, bar = 2, norf = NA)
vect 
 foo  bar norf 
  11    2   NA 
# To get the names of the vector we can pass vect as an argument to names()
names(vect)
[1] "foo"  "bar"  "norf"
# We can reverse the process if we want to name a vector after the fact 
vect2 <- c(11,2,NA)
names(vect2) <- c("foo","bar","norf")

# To subset the second element we can use
vect["bar"]
bar 
  2 
# To subset a vector instead of just one element we can do 
vect[c("foo","bar")]
foo bar 
 11   2 

[ ]

The single brackets always returns an object of the same class as the original. So if you are extracting from a vector you’ll get a vector. If you subset a list you get a list. Can be used to select more than one element of an object. Only [ ] can be used to extract multiple elements of a list.

x <- c("a", "b", "c", "c", "d", "a")       #subsetting a vector
x[1]           #let's use a numeric index
[1] "a"
x[2]
[1] "b"
x[1:4]
[1] "a" "b" "c" "c"
x[x>"a"]      #let's use a logical index
[1] "b" "c" "c" "d"
u <- x>"a"
u
[1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE
x[u] 
[1] "b" "c" "c" "d"

multiple elements

x<- list(foo=1:4, bar=0.6, baz="shaya")
x[c(1,3)]
$foo
[1] 1 2 3 4

$baz
[1] "shaya"

matrix

x<- matrix(1:6,2,3)
x 
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
x[1,2]
[1] 3
x[1,2, drop=FALSE]
     [,1]
[1,]    3
x[1,]
[1] 1 3 5

[[ ]]

Double brackets is used to extract a SINGLE element from a list or a data frame. It doesn’t necessarily have to be a list or a data frame. Let’s subset a list called foo, a sequence from 1 to 4, the second element is named bar which is the number 0.6

x<- list(foo = 1:4, bar = 0.6)
x[1]
$foo
[1] 1 2 3 4
x[2]                 # so far we're familiar with all of it $bar [1] 0.6 
$bar
[1] 0.6
x[[1]]              # [[ ]] here I just get back just the sequence 1 to 4 [1] 1 2 3 4 
[1] 1 2 3 4
x[[2]]
[1] 0.6

calculated name

[[ ]] has to be used to extract calculated or assigned names, here below I assign “foo” to name then use name to extract the subset

x<- list(foo=1:4, bar=0.6, baz="shaya")
name <- "foo"
x[[name]] 
[1] 1 2 3 4
x$name
NULL
x$foo
[1] 1 2 3 4

nested elements

x <- list(a= list(10,12,14), b=c(3.14,2.81))
x[[c(1,3)]]
[1] 14
x[[1]][[3]]
[1] 14
x[[c(2,1)]] 
[1] 3.14
x[[2]][[1]]
[1] 3.14

$

Is used like [[ ]] but to extract it by name.

x<- list(foo = 1:4, bar = 0.6)
x$bar
[1] 0.6
x$foo
[1] 1 2 3 4

partial matching

You can use $ for partial matching because $ looks for the first matching letter, and if we only have one element it will work, but if we use [[ ]] which looks for exact match it will not work. So we can always force it to by setting exact=“FALSE”

x<- list(aamresedd = 1:5)
x
$aamresedd
[1] 1 2 3 4 5
x$a
[1] 1 2 3 4 5
x[["a"]]
NULL
x[["a", exact = FALSE]]
[1] 1 2 3 4 5

removing missing values

We just create a logical vector that tells us where the missing values are, then we exclude it

x<-c(1,2,NA,4,NA,5)
bad <- is.na(x)

x[!bad]
[1] 1 2 4 5

which

Another way to omit NA values is to use which(). It returns a logical T/F for the condition. We know that if a value is NA it’s value is 0 so to check if a value>0 means we are overlooking NA. Below - we have a df - in the first block we filtered the df for when var3 > 11 - we can do the same to filter for rows that are NA, just use the logical comparison >0

x[which(x$var3>11),]

x[which(x$var2>0),]

Match_df


set.seed(13435)
x <- data.frame("var1"= c(1,3,4,9,1,17,4), "var2"=sample(6:12), "var3"=sample(11:17))
x
  var1 var2 var3
1    1   12   13
2    3   11   12
3    4    8   15
4    9    6   11
5    1   10   14
6   17    9   16
7    4    7   17
y <- data.frame("var1"= c(1,2,3,4,5), "var3"=sample(8:12), "var5"=sample(13:17))
y
  var1 var3 var5
1    1   12   17
2    2   11   16
3    3    8   13
4    4    9   15
5    5   10   14
w <- data.frame("var4"= c(1,2,7,41), "var6"=sample(12:15))
z <- data.frame("var1"= c(1,12,17,4), "var2"=sample(3:6))

Is a great function to use if you want to extract the rows in one df that share a common value with some rows in another df.

It is very similar to left join/common merge(x,y, by=“var”) from above without the all.x parameter being used.

For example the code below will give us

  • All the rows in x where x$var1 is found in y$var1

  • match_df does not merge, or bind any columns from df=y

  • It only searches y$var1 looking for a match to the row it is on and the value of x$var1 in that row

  • If the var1 value is a match it will extract that specific row from df=x

  • It does in order from top to bottom and appends to the result df in that same order

  • Unlike merge which brings/merges over the extra columns from df=y and arranges the merged df in order of x$var1

matched <- match_df(x,y,on="var1" )
matched
  var1 var2 var3
1    1   12   13
2    3   11   12
3    4    8   15
5    1   10   14
7    4    7   17

Here is another example

matched_coal <- match_df(NEI, coal_filter, on="SCC")

Skimr


skim_without_charts()

Some packages contain more advanced functions for summarizing and exploring your data. One example is the `skimr` package, which has a number of functions for this purpose. For example, the `skim_without_charts()` function provides a detailed summary of the data.

skim_without_charts(data frame name)

Dplyr


“The dplyr philosophy is to have small functions that each do one thing well.” Specifically, dplyr supplies five ‘verbs’ that cover most fundamental data manipulation tasks: select(), filter(), arrange(), mutate(), and summarize().