library(dplyr)
library(plyr) # for match_df
Filter - Subset
This page is one of the most used pages in R when you are dealing with preparing, cleaning, processing or wrangling your data (whatever you feel like choosing as a verb).
Just because one function is in one paragraph or subsection it doesn’t exclude it from being in another one. The headings are a general way of grouping these functions, most can be used interchangeably, but some cannot.
Many other functions that could be used to filter, extract, or subset your data could be found in the other posts following this one.
Filter
Well selecting a smaller set from a larger set is considered filtering, and is also considered subsetting and we’ll cover that later in the page.
filter
Filtering your data before you plot allows you to focus on specific subsets of your data and gain more targeted insights. To do this, just include the dplyr filter() function in your ggplot syntax.
|> filter(variable1 == "DS") |>
data ggplot()
multiple conditions
Here is an example we filter using multiple conditions
<- filter(hotel_bookings,
onlineta_city_hotels =="City Hotel" &
(hotel$market_segment=="Online TA"))
hotel_bookings
#______________________________can be written this way as well
<- hotel_bookings |>
onlineta_city_hotels_v2 filter(hotel=="City Hotel") |>
filter(market_segment=="Online TA")
filter & mean
Here we filter out rows with the 2 conditional values then calculate the mean
<- hw1 |>
cond_all filter(hw1$Ozone > 31 & hw1$Temp > 90)
mean(cond_all$Solar.R)
filter & subset
Create a subset for the only state given by the user
<- filter(datain, State == state)
stateonly #note state is the argument passed into the function, subset which will cover later
<- subset(datain, State == state) stateonly
multiple col
Let’s filter the stateonly subset to include Hospital Name, and the outcome given by user
<- select(stateonly, c("State", "Hospital.Name", all_of(usethis))) subdata
delete rows
I will not delete any rows, instead I’ll create a new df excluding the rows with <=0 length, or if you want to go further and include another logical condition OR | where station names are = HQ QR you can be my guest. I will not do it, instead I will check the new df afterwards for station names == HQ QR and deal with that at the time if they exist. I believe including that condition in this step might cause results I am not ready for yet. We know for sure we only have 117+93= 210 rows so let’s focus on that now.
<- all_trips19_20[all_trips19_20$ride_length > 0,]
ride_length all_trips19_20_v2 dim(all_trips19_20_v2)
delete multiple vars
- Here we’ll remove rows when conditions from multiple columns are met
- We use dim to see how many rows were deleted from the original df
<- all_trips19_20_v2
all_trips19_20_v3 !(all_trips19_20_v2$start_station_name == "HQ QR" |
[$end_station_name == "HQ QR"),]
all_trips19_20_v2dim(all_trips19_20_v3)
exclude
What if we want all elements except a certain one or ones? we use the same as above with (-)
- Having the (-) sign outside or inside the brackets will work
-c(2,10)]
x[c(-2,-10)] x[
extract
- Even though we could use filter to create a new df based on the conditions we want met, which in essence is extracting.
- We can use slice
top N - slice
We can choose the top N of values by processing our calculations, group, arrange in whatever order we want, then slice out the part we want. We can actually take out a middle piece look into offset - If we want to extract the 5 stations with the highest number of rides - Top 5 so to speak - There are other ways to do this, but here we get to learn all the steps together - We can group - Summarise() to show the count of rides per station - Arrange the result in descending order - Slice out the top N (5) in this case
<- trips_19_20 |>
top_stations group_by(start_station_id) |>
summarise(number_of_rides =n()) |>
arrange(desc(number_of_rides))
<- top_stations |>
top_5 slice(1:5)
select
- Let’s say you are primarily interested in the following variables: ‘hotel’, ‘is_canceled’, and ‘lead_time’
- Use select(..,..,..) and list all the wanted column names
- Create a new data frame with just those columns, calling it `trimmed_df` by adding the variable names with bookings_df being the original data frame
<- bookings_df |>
trimmed_df select(hotel,is_canceled,lead_time )
<- hw1 |> #we have 153 rows and we just select one column
oz select(Ozone)
select(-c())
- You can see the use of this function in the code below
- We can use select with -c() the negation of select to select all except what’s in the -c(…)
Subset
To extract a subset of objects in R. There are several operators you can use in R. We’ll see how to extract elements from a vector based on some conditions that we specify. - For example, we may only be interested in the first 20 elements of a vector, or only the elements that are not NA, or only those that are positive or correspond to a specific variable of interest. - X has been created and contains a random ordering of 20 numbers (from a standard normal distribution) and 20 NAs.
- See subset section above as well. One easy way to select needed columns when the list is too long to use select and list all the columns we want, we can create a subset and use the select() function with a - to exclude the columns we DON’T WANT.
- The code below creates an entire subset EXCLUDING the 6 columns listed inside the -c()
- Or you can use the [] selection to do the same thing as you see in the code chunk next
multiple columns
<- subset(trips19_20, select = -c(start_lat, start_lng, end_lat, end_lng, gender, birthyear))
trim_trips19_20
# OR
<- almostlist[c("State","Hospital.Name.y")] outlist
subset = filter
- Create a subset for the only state given by the user using subset
<- filter(datain, State == state)
stateonly
# OR
<- subset(datain, State == state) stateonly
negation
- We can subset data by excluding a specific a row that contains an undesirable value
- In this case we don’t want rows that have the string value “Not Available”
<- subset(subdata, subdata[3] != "Not Available") gooddata
subset & as
- Here is a use of as. as well as subset together.
- The second line starts with the inner subset(), so we are using r as data, and we are pulling the row where the column State to be equal to HI.
- That row is now a subset with however how many columns it has.
- From that subset we are pulling the value in column Hospital
- Then we are outputting the value as character.
<- rankall("heart attack", 4)
r as.character(subset(r, State == "HI")$Hospital)
complete.cases
Return a logical vector indicating which cases are complete, i.e., have no missing values.
<-c(1,2,NA,4,NA,5)
x<- c("a","b",NA,"d",NA,"f")
y<- complete.cases(x,y)
good good
[1] TRUE TRUE FALSE TRUE FALSE TRUE
x[good]
[1] 1 2 4 5
y[good]
[1] "a" "b" "d" "f"
<-c(1,2,NA,4,NA,5)
x<- c("a","b",NA,"d",NA,NA)
y<- complete.cases(x,y)
good good
[1] TRUE TRUE FALSE TRUE FALSE FALSE
x[good]
[1] 1 2 4
y[good]
[1] "a" "b" "d"
drop column
This is the same as the subset listed above, you can use -c() to extract the unwanted column(s)
<- trips19_20 |>
trim_trips19_20 select(-c(start_lat, start_lng, end_lat, end_lng, gender, birthyear))
Here is another example to drop one column and use the same dataset name:
<- all_trips19_20 |>
all_trips19_20 select(-c(tripduration))
extract rows
- So let’s create a new df with just those rows where the ride_length < 0
<- all_trips19_20[all_trips19_20$ride_length < 0, ] neg_length
extract certain values
<- all_trips19_20[all_trips19_20$ride_length == 0, ] zero_length
in
If we want to pull out the restaurants that are in two neighborhoods: Roland Par, and Homeland and assign them to the variable rest$nearme
$nearme = rest$neighborhood %in% c("Roland Park", "Homeland")
resttable(rest$nearme)
FALSE TRUE
1314 13
Another example
<- subset(named_pm0, State.Code == 36 & county.site %in% both)
cnt0 <- subset(named_pm1, State.Code == 36 & county.site %in% both) cnt1
across
if_any
if_all
R code in dplyr verbs is generally evaluated once per group. Inside across() however, code is evaluated once for each combination of columns and groups. across(.cols, .fns, …, .names = NULL, .unpack = FALSE) if_any(.cols, .fns, …, .names = NULL) if_all(.cols, .fns, …, .names = NULL)
<-
gdf tibble(g = c(1, 1, 2, 3), v1 = 10:13, v2 = 20:23) |>
group_by(g)
set.seed(1)
# Outside: 1 normal variate
<- rnorm(1)
n |> mutate(across(v1:v2, ~ .x + n)) gdf
# A tibble: 4 × 3
# Groups: g [3]
g v1 v2
<dbl> <int> <int>
1 1 10 20
2 1 11 21
3 2 12 22
4 3 13 23
Binary variables
If I want to count how many observations meet a logical condition I can have the results in True False values ### ifelse - I can create a binary variable with ifelse. - So what I get is a table if the zipcode is <0 or - Count of how many zip codes values are wrong
$zipWrong = ifelse(restData$zipcode < 0, TRUE, FALSE)
restDatatable(restData$zipWrong, restData$zipcode < 0)
FALSE TRUE
FALSE 1326 0
TRUE 0 1
Categorical variables
cut
- I can break up a list of variables using cut() which tells it I want to break it up based on a list of values I provide.
- Let’s say I want to break down the zipcodes by quarters/quantile and assign it to ~Groups
- group 1: 0-25%, group 2: 25-50%…. > It breaks a quantitative variable into a categorical variable
$zipGroups = cut(restData$zipCode, breaks = quantile(restData$zipCode))
restDatatable(restData$zipGroups)
-2.123e+04, 2.12e+04] (2.12e+04, 2.122e+04] (2.122e+04, 2.123e+04] (2.123e+04, 2.129e+04]
(337 375 282 332
table
Table will use the first argument as the rows and the second as the column and it’ll bread them down that way
table(restData$zipGroups,restData$zipCode)
-21226 21201 21205 21206 21207 ....
-2.123e+04, 2.12e+04] 0 136 201 0 0 ...
(2.12e+04, 2.122e+04] 0 0 0 27 30 ..
(2.122e+04, 2.123e+04] ....
(2.123e+04, 2.129e+04] ... (
cut2
An easier way to that, without specifying all the cuts manually like I did with cut, is to use cut2 from the Hmisc package. - If you go into the Hmisc package and you use the cut2 function, - You can actually specify I want to break the zip code up into four different groups and - I want to break them up according to the quantiles. - It will actually find out the quantiles for me and break it up into four groups according to the quantiles
library(Hmisc)
$zipGroups = cut2(restData$zipCode, g=4)
restDatatable(restData$zipGroups)
-21226,21205) (21205,21220) (21220,21227) (21227,21287)
(338 375 300 314
Create Factor Variables
factor
- You might want to create factor variables. In other words, it’s not clear that incrementing the zip code by one changes quantitatively.
- So, you might want to turn that into a factor variable and the way you do that is with the factor command
- It takes an input of this integer variable, and it turns it into a factor variable
- So, if I look at the first ten values of ZCF, it shows me these values and they look just like they did before
- It tells me how many different zip code levels there are
- There are 32 different zip code levels, and if I look at a class of this variable it’s a factor variable
$zcf <- factor(restData$zipCode)
restDafa$zcf[1:10]
restData1] 21206 21231 21224 21223 21228 21205 21211 21205 21231
[32 levels: -21226 21201 21202 21205 21206 21207 21208 21209....
relevel
A couple of other things that you might wanna know about factor variables - Create just a dummy vector, so that we can see some other properties of factor variables - I’m gonna create a vector of yeses and nos, and so I’m gonna do that randomly, a size ten vector of yeses and nos - And I’m gonna turn that into a factor variable, by default, it’ll treat the lowest value alphabetically as the first value, the factor variable - Suppose I want to treat the second, the yes values as the lowest value - Then I can relevel that variable and make the reference be the =yes value - Then I’ll change the type and remember it’ll start with the lowest value and call it a 1 and the next lowest value a 2 and so on
<- sample(c("yes", "no"), size=10, replace=TRUE)
yesno = factor(yesno, levels = c("yes", "no"))
yesnofac relevel(yesnofac, ref = "yes")
[1] yes yes no yes yes yes no no yes yes
Levels: yes no
as.numeric(yesnofac)
[1] 1 1 2 1 1 1 2 2 1 1
Check values
any==0
any(all_trips19_20$ride_length == 0)
any<0
any(all_trips19_20$ride_length < 0) ride_length
all
all(all_trips19_20$ride_length > 0)
is.na is.nan
Note that nan contains na, so is.na would find nan as well as na, but is.nan will only find nan values
any(is.na(all_trips19_20$ride_length))
!is.na
any(!is.na(all_trips19_20$ride_length))
which
Another way to omit NA values or to use a logical condition is to use which(). It returns a logical T/F for the condition. We know that if a value is NA it’s value is 0 so to check if a value>0 means we are overlooking NA. Below - We have a df - In the first block we filtered the df for when var3 > 11 - We can do the same to filter for rows that are NA, just use the logical comparison >0
> x[which(x$var3>11),]
which(x$var2>0),] x[
any combination
any(!is.na(all_trips19_20$ride_length) < 0)
check_for_negative_values
If we wanted to know which rows have a negative value in a specific column, one way is to install another package “fixr” if you want to identify the rows that contain the negative ride_length.
This function checks if a data frame contains negative values and returns their indices if any are found.
install.packages("fixr")
library(fixr)
check_for_negative_values(all_trips19_20$ride_length)
Data frame contains negative values.1] 520402 648667 651963 652290 653845 655288 655654 656362 657167 658842 660684 663630
[19] 669752 678262 678499 678889 679182 679252 680301 680757 682576 682662 682853 684254
[37] 694476 694635 696693 697774 698938 699115 699204 699450 699619 699855 702399 702914
[55] 714612 714687 714761 715171 715474 719275 719680 725559 726496 730499 731284 731789
[73] 742697 743604 743619 745898 747194 747233 748686 754006 754853 755032 755536 756753
[91] 763358 766631 767985 768102 768693 768833 772284 772868 774788 775777 776578 777009
[109] 779170 779496 781519 781699 786503 786687 788671 788946 789252 [
identify ALL <0 values
Let’s see if any other NEGATIVE values exist in the entire dataset. The output displays all the rows where a NEGATIVE value appears in a specific column. As you can see we have 117 rows, and the only column with negative values is column 10 which happens to be the ride_length column.
check_for_negative_values(all_trips19_20)
Data frame contains negative values.
row col1,] 520402 10
[2,] 648667 10
[3,] 651963 10
[4,] 652290 10 ...
[115,] 788671 10
[116,] 788946 10
[117,] 789252 10 [
Identical
> identical(vect,vect2)
identical 1] TRUE [
unique
Unique returns a vector with all duplicate elements removed. This is used quite a bit when viewing a data set for the first time using SQL, or when we want to group a data set by ID’s or some unique variable.
unique(c(3, 4, 5, 5, 5, 6, 6))
[1] 3 4 5 6
sapply
We can check for unique using sapply
sapply(unique_vals,length)
name landmass zone area population language religion bars 194 6 4 136 48 10 8 5
blue gold white black orange mainhue circles crosses 2 2 2 2 2 8 4 3
triangle icon animate text topleft botright 2 2 2 2 7 8
section
1:10] x[
[1] 1 2 NA 4 NA 5 NA NA NA NA
Index vectors come in four different flavors – logical vectors, vectors of positive integers, vectors of negative integers, and vectors of character strings
na
One common scenario when working with real-world data is that we want to extract all elements of a vector that are not NA (i.e. missing data). Recall that is.na(x) yields a vector of logical values the same length as x, with TRUEs corresponding to NA values in x and FALSEs corresponding to non-NA values in x.
So if we want to find all the NA values in a vector we can use, and we can find all the non-na values as well next by creating a vector y
is.na(x)] x[
[1] NA NA
<- x[!is.na(x)]
y y
[1] 1 2 4 5
sub positive values
>0] y[y
[1] 1 2 4 5
>0] x[x
[1] 1 2 NA 4 NA 5
!is.na(x) & x>0 ] x[
[1] 1 2 4 5
zero-based indexing
Many programming languages use what’s called ‘zero-based indexing’, which means that the first element of a vector is considered element 0.
R uses ‘one-based indexing’, which (you guessed it!) means the first element of a vector is considered element 1.
So how will we subset the 3, 5, and 7th element of x? Both worked but it appears that the second one is the way to go
c(x[3],x[5],x[7])
[1] NA NA NA
c(3,5,7)] x[
[1] NA NA NA
#______________________what happens when we ask for out or range values? let's see
0] x[
numeric(0)
numeric(0)
numeric(0)
exclude
What if we want all elements except a certain one or ones? we use the same as above with -
#________________________both of these will do the trick
-c(2,10)] x[
[1] 1 NA 4 NA 5
c(-2,-10)] x[
[1] 1 NA 4 NA 5
names
# Create a numeric vector with three named elements
<- c(foo = 11, bar = 2, norf = NA)
vect vect
foo bar norf
11 2 NA
# To get the names of the vector we can pass vect as an argument to names()
names(vect)
[1] "foo" "bar" "norf"
# We can reverse the process if we want to name a vector after the fact
<- c(11,2,NA)
vect2 names(vect2) <- c("foo","bar","norf")
# To subset the second element we can use
"bar"] vect[
bar
2
# To subset a vector instead of just one element we can do
c("foo","bar")] vect[
foo bar
11 2
[ ]
The single brackets always returns an object of the same class as the original. So if you are extracting from a vector you’ll get a vector. If you subset a list you get a list. Can be used to select more than one element of an object. Only [ ] can be used to extract multiple elements of a list.
<- c("a", "b", "c", "c", "d", "a") #subsetting a vector
x 1] #let's use a numeric index x[
[1] "a"
2] x[
[1] "b"
1:4] x[
[1] "a" "b" "c" "c"
>"a"] #let's use a logical index x[x
[1] "b" "c" "c" "d"
<- x>"a"
u u
[1] FALSE TRUE TRUE TRUE TRUE FALSE
x[u]
[1] "b" "c" "c" "d"
multiple elements
<- list(foo=1:4, bar=0.6, baz="shaya")
xc(1,3)] x[
$foo
[1] 1 2 3 4
$baz
[1] "shaya"
matrix
<- matrix(1:6,2,3)
x x
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
1,2] x[
[1] 3
1,2, drop=FALSE] x[
[,1]
[1,] 3
1,] x[
[1] 1 3 5
[[ ]]
Double brackets is used to extract a SINGLE element from a list or a data frame. It doesn’t necessarily have to be a list or a data frame. Let’s subset a list called foo, a sequence from 1 to 4, the second element is named bar which is the number 0.6
<- list(foo = 1:4, bar = 0.6)
x1] x[
$foo
[1] 1 2 3 4
2] # so far we're familiar with all of it $bar [1] 0.6 x[
$bar
[1] 0.6
1]] # [[ ]] here I just get back just the sequence 1 to 4 [1] 1 2 3 4 x[[
[1] 1 2 3 4
2]] x[[
[1] 0.6
calculated name
[[ ]] has to be used to extract calculated or assigned names, here below I assign “foo” to name then use name to extract the subset
<- list(foo=1:4, bar=0.6, baz="shaya")
x<- "foo"
name x[[name]]
[1] 1 2 3 4
$name x
NULL
$foo x
[1] 1 2 3 4
nested elements
<- list(a= list(10,12,14), b=c(3.14,2.81))
x c(1,3)]] x[[
[1] 14
1]][[3]] x[[
[1] 14
c(2,1)]] x[[
[1] 3.14
2]][[1]] x[[
[1] 3.14
$
Is used like [[ ]] but to extract it by name.
<- list(foo = 1:4, bar = 0.6)
x$bar x
[1] 0.6
$foo x
[1] 1 2 3 4
partial matching
You can use $ for partial matching because $ looks for the first matching letter, and if we only have one element it will work, but if we use [[ ]] which looks for exact match it will not work. So we can always force it to by setting exact=“FALSE”
<- list(aamresedd = 1:5)
x x
$aamresedd
[1] 1 2 3 4 5
$a x
[1] 1 2 3 4 5
"a"]] x[[
NULL
"a", exact = FALSE]] x[[
[1] 1 2 3 4 5
removing missing values
We just create a logical vector that tells us where the missing values are, then we exclude it
<-c(1,2,NA,4,NA,5)
x<- is.na(x)
bad
!bad] x[
[1] 1 2 4 5
which
Another way to omit NA values is to use which(). It returns a logical T/F for the condition. We know that if a value is NA it’s value is 0 so to check if a value>0 means we are overlooking NA. Below - we have a df - in the first block we filtered the df for when var3 > 11 - we can do the same to filter for rows that are NA, just use the logical comparison >0
which(x$var3>11),]
x[
which(x$var2>0),] x[
Match_df
set.seed(13435)
<- data.frame("var1"= c(1,3,4,9,1,17,4), "var2"=sample(6:12), "var3"=sample(11:17))
x x
var1 var2 var3
1 1 12 13
2 3 11 12
3 4 8 15
4 9 6 11
5 1 10 14
6 17 9 16
7 4 7 17
<- data.frame("var1"= c(1,2,3,4,5), "var3"=sample(8:12), "var5"=sample(13:17))
y y
var1 var3 var5
1 1 12 17
2 2 11 16
3 3 8 13
4 4 9 15
5 5 10 14
<- data.frame("var4"= c(1,2,7,41), "var6"=sample(12:15))
w <- data.frame("var1"= c(1,12,17,4), "var2"=sample(3:6)) z
Is a great function to use if you want to extract the rows in one df that share a common value with some rows in another df.
It is very similar to left join/common merge(x,y, by=“var”) from above without the all.x parameter being used.
For example the code below will give us
All the rows in x where
x$var1
is found iny$var1
match_df does not merge, or bind any columns from df=y
It only searches y$var1 looking for a match to the row it is on and the value of x$var1 in that row
If the var1 value is a match it will extract that specific row from df=x
It does in order from top to bottom and appends to the result df in that same order
Unlike merge which brings/merges over the extra columns from df=y and arranges the merged df in order of x$var1
<- match_df(x,y,on="var1" )
matched matched
var1 var2 var3
1 1 12 13
2 3 11 12
3 4 8 15
5 1 10 14
7 4 7 17
Here is another example
<- match_df(NEI, coal_filter, on="SCC") matched_coal
Skimr
skim_without_charts()
Some packages contain more advanced functions for summarizing and exploring your data. One example is the `skimr` package, which has a number of functions for this purpose. For example, the `skim_without_charts()` function provides a detailed summary of the data.
skim_without_charts(data frame name)
Dplyr
“The dplyr philosophy is to have small functions that each do one thing well.” Specifically, dplyr supplies five ‘verbs’ that cover most fundamental data manipulation tasks: select(), filter(), arrange(), mutate(), and summarize().