Join - Merge


This is a unique page as it covers topics that will be used often, such as

  • Join and Merge, Intersect, match_df
  • Flip from Long to Short

Packages

library(tidyverse)
library(plyr)  # for match_df

Data


set.seed(13435)
x <- data.frame("var1"= c(1,3,4,9,1,17,4), "var2"=sample(6:12), "var3"=sample(11:17))
x
  var1 var2 var3
1    1   12   13
2    3   11   12
3    4    8   15
4    9    6   11
5    1   10   14
6   17    9   16
7    4    7   17
y <- data.frame("var1"= c(1,2,3,4,5), "var3"=sample(8:12), "var5"=sample(13:17))
y
  var1 var3 var5
1    1   12   17
2    2   11   16
3    3    8   13
4    4    9   15
5    5   10   14
w <- data.frame("var4"= c(1,2,7,41), "var6"=sample(12:15))
w
  var4 var6
1    1   14
2    2   13
3    7   12
4   41   15
z <- data.frame("var1"= c(1,12,17,4), "var2"=sample(3:6))
z
  var1 var2
1    1    6
2   12    5
3   17    4
4    4    3

Merge


Left outer join

The `merge()` function in R is a versatile tool for combining datasets. To perform a left join, you use the following syntax if you want to use the left df as the starting point and extract common rows with a second df=y or add to the left df=x any columns that are in the second df and not in the starting df=x that share the same rows as the starting df=x

Here:

  • `x` is the left dataset.
  • `y` is the right dataset.
  • `by` specifies the common column(s) to join on.
  • `all.x = TRUE` ensures that all rows from the left dataset are retained and if x has more rows than y, then those extra rows are filled in with NA

all.x=TRUE

  • With merge you have the option to preserve all the rows that already exist in the left df by using: all.x = TRUE
  • Means all rows in the left df are guaranteed to be in the merged df
  • Includes the additional columns if any exist for the rows that share the by=var1
left_outer <- merge(x,y,by="var1", all.x=TRUE)
left_outer
  var1 var2 var3.x var3.y var5
1    1   12     13     12   17
2    1   10     14     12   17
3    3   11     12      8   13
4    4    8     15      9   15
5    4    7     17      9   15
6    9    6     11     NA   NA
7   17    9     16     NA   NA

Inner join

common only

  • If you only want the rows in the left df that share the variable var1 in their rows with the df=y on the right
  • Omit all.x argument from merge above
  • As you see below it only yields the rows from df= x that share a common “var1” with df=y
  • Notice the additional columns from df=y that were added
  • merge() arranges the rows in ascending order based on the value of by=var1
inner_common <- merge(x,y,by="var1")
inner_common
  var1 var2 var3.x var3.y var5
1    1   12     13     12   17
2    1   10     14     12   17
3    3   11     12      8   13
4    4    8     15      9   15
5    4    7     17      9   15

Right outer join

Opposite of left outer join, where in addition to including the common rows, we ensure that all rows from df=y are included

right_outer <- merge(x,y,by="var1",all.y=TRUE)
right_outer
  var1 var2 var3.x var3.y var5
1    1   12     13     12   17
2    1   10     14     12   17
3    2   NA     NA     11   16
4    3   11     12      8   13
5    4    8     15      9   15
6    4    7     17      9   15
7    5   NA     NA     10   14

Outer Join

Will merge all rows from both x and y

outer_join <- merge(x,y, by="var1", all=TRUE)
outer_join
  var1 var2 var3.x var3.y var5
1    1   12     13     12   17
2    1   10     14     12   17
3    2   NA     NA     11   16
4    3   11     12      8   13
5    4    8     15      9   15
6    4    7     17      9   15
7    5   NA     NA     10   14
8    9    6     11     NA   NA
9   17    9     16     NA   NA

By.X & By.Y

Sometime, you might want to merge two dfs with colnames that don’t match. Therefore you have to specify which index column you want to use from each df.

For the data here we can have:

  • As you see it finds the common rows
  • Arranges them in order
byx_byy <- merge(x, y, by.x = "var2", by.y = "var3" )
byx_byy
  var2 var1.x var3 var1.y var5
1    8      4   15      3   13
2    9     17   16      4   15
3   10      1   14      5   14
4   11      3   12      2   16
5   12      1   13      1   17

Cross join

Is not actually a merge but a cross multiplication.

It will return a cartesian product, the multiplication across two tables so the result would be large. In essence every row from the left df will get multiplied by the rows in the right df

cross_join <- merge(x,y,by=NULL)

Multiple columns

If you want to use multiple columns to merge, meaning by=c(x,x,x,) instead of by=“var1 you want to use multiple variables

multiple_cols <- merge(x,y, by=c("var1", "var3"))
multiple_cols
[1] var1 var3 var2 var5
<0 rows> (or 0-length row.names)

X_join


Left_join

x_join is similar to merge, except that merge is from base R , while x_join is from dplyr package

Look at the similarity with merge/inner join

leftjoin <- left_join(x,y, by ="var1")
leftjoin
  var1 var2 var3.x var3.y var5
1    1   12     13     12   17
2    3   11     12      8   13
3    4    8     15      9   15
4    9    6     11     NA   NA
5    1   10     14     12   17
6   17    9     16     NA   NA
7    4    7     17      9   15

Intersect


So how is intersect different from the merge with common only approach?

  • Well intersect returns the rows that have common “by” value.
  • What does that really mean, it means that if the df has multiple rows that have 4 as the value of var1, and it happens to exists in y$var1 then the intersect function will return the value that var1 is, in this case 4
  • Regardless of how many times it matches in y df.
  • So if we only want to extract the common value of var1 from both dfs this is the function to use
intersected <- intersect(x$var1, y$var1)
intersected
[1] 1 3 4

Match_df


Is a great function to use if you want to extract the rows in one df that share a common value with some rows in another df.

It is very similar to left join/common merge(x,y, by=“var”) from above without the all.x parameter being used.

For example the code below will give us

  • All the rows in x where x$var1 is found in y$var1
  • match_df does not merge, or bind any columns from df=y
  • It only searches y$var1 looking for a match to the row it is on and the value of x$var1 in that row
  • If the var1 value is a match it will extract that specific row from df=x
  • It does in order from top to bottom and appends to the result df in that same order
  • Unlike merge which brings/merges over the extra columns from df=y and arranges the merged df in order of x$var1
matched <- match_df(x,y,on="var1" )
matched
  var1 var2 var3
1    1   12   13
2    3   11   12
3    4    8   15
5    1   10   14
7    4    7   17

Here is another example

matched_coal <- match_df(NEI, coal_filter, on="SCC")

Concatenate


Unlike merge, concatenate doesn’t seek common variables. It simply looks to attach one df to another.

There are two ways you can merge datasets; you can add information by adding more rows or by adding more columns to your dataset. In general, when you have datasets that have the same set of columns or have the same set of observations, you can concatenate them vertically or horizontally, respectively using rbind and cbind. If they are not of equal number of columns then you can use bind_rows, and bind_cols both from dplyr

Vertically

rbind

When you have multiple datasets that have the same set of columns, you can concatenate one dataset to another, vertically. That is, keeping the columns of your dataset, you can add more rows to it. Having such information in one file will make it easier for you to aggregate and see the bigger picture without the hassle of switching back and forth between multiple files and losing track of them.

In our case below, we know that x,y don’t have the same columns so it will not execute and produce an error so I will not run it, the code is to show what the syntax is.

rbound <- rbind(x,y)

bind_rows

In the event that you have multiple datasets that have different set of columns with some common ones, you can use bind_rows().

rowsbound <- bind_rows(x,z)
rowsbound
   var1 var2 var3
1     1   12   13
2     3   11   12
3     4    8   15
4     9    6   11
5     1   10   14
6    17    9   16
7     4    7   17
8     1    6   NA
9    12    5   NA
10   17    4   NA
11    4    3   NA

Horizontally

cbind

Here instead of growing the first df vertically (adding rows to it), we expand it horizontally which means we add the columns from the second df to its right side. Note: both dfs have to have the same set of columns

columnbound <- cbind(x,y)

bind_cols

Here since we know both dfs have different columns but same number of rows

bindcolumns <- bind_cols(w,z)
bindcolumns
  var4 var6 var1 var2
1    1   14    1    6
2    2   13   12    5
3    7   12   17    4
4   41   15    4    3