library(tidyverse)
library(plyr) # for match_df
Join - Merge
This is a unique page as it covers topics that will be used often, such as
- Join and Merge, Intersect, match_df
- Flip from Long to Short
Packages
Data
set.seed(13435)
<- data.frame("var1"= c(1,3,4,9,1,17,4), "var2"=sample(6:12), "var3"=sample(11:17))
x x
var1 var2 var3
1 1 12 13
2 3 11 12
3 4 8 15
4 9 6 11
5 1 10 14
6 17 9 16
7 4 7 17
<- data.frame("var1"= c(1,2,3,4,5), "var3"=sample(8:12), "var5"=sample(13:17))
y y
var1 var3 var5
1 1 12 17
2 2 11 16
3 3 8 13
4 4 9 15
5 5 10 14
<- data.frame("var4"= c(1,2,7,41), "var6"=sample(12:15))
w w
var4 var6
1 1 14
2 2 13
3 7 12
4 41 15
<- data.frame("var1"= c(1,12,17,4), "var2"=sample(3:6))
z z
var1 var2
1 1 6
2 12 5
3 17 4
4 4 3
Merge
Left outer join
The `merge()` function in R is a versatile tool for combining datasets. To perform a left join, you use the following syntax if you want to use the left df as the starting point and extract common rows with a second df=y or add to the left df=x any columns that are in the second df and not in the starting df=x that share the same rows as the starting df=x
Here:
- `x` is the left dataset.
- `y` is the right dataset.
- `by` specifies the common column(s) to join on.
- `all.x = TRUE` ensures that all rows from the left dataset are retained and if x has more rows than y, then those extra rows are filled in with NA
all.x=TRUE
- With merge you have the option to preserve all the rows that already exist in the left df by using: all.x = TRUE
- Means all rows in the left df are guaranteed to be in the merged df
- Includes the additional columns if any exist for the rows that share the by=var1
<- merge(x,y,by="var1", all.x=TRUE)
left_outer left_outer
var1 var2 var3.x var3.y var5
1 1 12 13 12 17
2 1 10 14 12 17
3 3 11 12 8 13
4 4 8 15 9 15
5 4 7 17 9 15
6 9 6 11 NA NA
7 17 9 16 NA NA
Inner join
common only
- If you only want the rows in the left df that share the variable var1 in their rows with the df=y on the right
- Omit all.x argument from merge above
- As you see below it only yields the rows from df= x that share a common “var1” with df=y
- Notice the additional columns from df=y that were added
- merge() arranges the rows in ascending order based on the value of by=var1
<- merge(x,y,by="var1")
inner_common inner_common
var1 var2 var3.x var3.y var5
1 1 12 13 12 17
2 1 10 14 12 17
3 3 11 12 8 13
4 4 8 15 9 15
5 4 7 17 9 15
Right outer join
Opposite of left outer join, where in addition to including the common rows, we ensure that all rows from df=y are included
<- merge(x,y,by="var1",all.y=TRUE)
right_outer right_outer
var1 var2 var3.x var3.y var5
1 1 12 13 12 17
2 1 10 14 12 17
3 2 NA NA 11 16
4 3 11 12 8 13
5 4 8 15 9 15
6 4 7 17 9 15
7 5 NA NA 10 14
Outer Join
Will merge all rows from both x and y
<- merge(x,y, by="var1", all=TRUE)
outer_join outer_join
var1 var2 var3.x var3.y var5
1 1 12 13 12 17
2 1 10 14 12 17
3 2 NA NA 11 16
4 3 11 12 8 13
5 4 8 15 9 15
6 4 7 17 9 15
7 5 NA NA 10 14
8 9 6 11 NA NA
9 17 9 16 NA NA
By.X & By.Y
Sometime, you might want to merge two dfs with colnames that don’t match. Therefore you have to specify which index column you want to use from each df.
For the data here we can have:
- As you see it finds the common rows
- Arranges them in order
<- merge(x, y, by.x = "var2", by.y = "var3" )
byx_byy byx_byy
var2 var1.x var3 var1.y var5
1 8 4 15 3 13
2 9 17 16 4 15
3 10 1 14 5 14
4 11 3 12 2 16
5 12 1 13 1 17
Cross join
Is not actually a merge but a cross multiplication.
It will return a cartesian product, the multiplication across two tables so the result would be large. In essence every row from the left df will get multiplied by the rows in the right df
<- merge(x,y,by=NULL) cross_join
Multiple columns
If you want to use multiple columns to merge, meaning by=c(x,x,x,) instead of by=“var1 you want to use multiple variables
<- merge(x,y, by=c("var1", "var3"))
multiple_cols multiple_cols
[1] var1 var3 var2 var5
<0 rows> (or 0-length row.names)
X_join
Left_join
x_join is similar to merge, except that merge is from base R , while x_join is from dplyr package
Look at the similarity with merge/inner join
<- left_join(x,y, by ="var1")
leftjoin leftjoin
var1 var2 var3.x var3.y var5
1 1 12 13 12 17
2 3 11 12 8 13
3 4 8 15 9 15
4 9 6 11 NA NA
5 1 10 14 12 17
6 17 9 16 NA NA
7 4 7 17 9 15
Intersect
So how is intersect different from the merge with common only approach?
- Well intersect returns the rows that have common “by” value.
- What does that really mean, it means that if the df has multiple rows that have 4 as the value of var1, and it happens to exists in y$var1 then the intersect function will return the value that var1 is, in this case 4
- Regardless of how many times it matches in y df.
- So if we only want to extract the common value of var1 from both dfs this is the function to use
<- intersect(x$var1, y$var1)
intersected intersected
[1] 1 3 4
Match_df
Is a great function to use if you want to extract the rows in one df that share a common value with some rows in another df.
It is very similar to left join/common merge(x,y, by=“var”) from above without the all.x parameter being used.
For example the code below will give us
- All the rows in x where
x$var1
is found iny$var1
- match_df does not merge, or bind any columns from df=y
- It only searches y$var1 looking for a match to the row it is on and the value of x$var1 in that row
- If the var1 value is a match it will extract that specific row from df=x
- It does in order from top to bottom and appends to the result df in that same order
- Unlike merge which brings/merges over the extra columns from df=y and arranges the merged df in order of x$var1
<- match_df(x,y,on="var1" )
matched matched
var1 var2 var3
1 1 12 13
2 3 11 12
3 4 8 15
5 1 10 14
7 4 7 17
Here is another example
<- match_df(NEI, coal_filter, on="SCC") matched_coal
Concatenate
Unlike merge, concatenate doesn’t seek common variables. It simply looks to attach one df to another.
There are two ways you can merge datasets; you can add information by adding more rows or by adding more columns to your dataset. In general, when you have datasets that have the same set of columns or have the same set of observations, you can concatenate them vertically or horizontally, respectively using rbind and cbind. If they are not of equal number of columns then you can use bind_rows, and bind_cols both from dplyr
Vertically
rbind
When you have multiple datasets that have the same set of columns, you can concatenate one dataset to another, vertically. That is, keeping the columns of your dataset, you can add more rows to it. Having such information in one file will make it easier for you to aggregate and see the bigger picture without the hassle of switching back and forth between multiple files and losing track of them.
In our case below, we know that x,y don’t have the same columns so it will not execute and produce an error so I will not run it, the code is to show what the syntax is.
<- rbind(x,y) rbound
bind_rows
In the event that you have multiple datasets that have different set of columns with some common ones, you can use bind_rows().
<- bind_rows(x,z)
rowsbound rowsbound
var1 var2 var3
1 1 12 13
2 3 11 12
3 4 8 15
4 9 6 11
5 1 10 14
6 17 9 16
7 4 7 17
8 1 6 NA
9 12 5 NA
10 17 4 NA
11 4 3 NA
Horizontally
cbind
Here instead of growing the first df vertically (adding rows to it), we expand it horizontally which means we add the columns from the second df to its right side. Note: both dfs have to have the same set of columns
<- cbind(x,y) columnbound
bind_cols
Here since we know both dfs have different columns but same number of rows
<- bind_cols(w,z)
bindcolumns bindcolumns
var4 var6 var1 var2
1 1 14 1 6
2 2 13 12 5
3 7 12 17 4
4 41 15 4 3