The merge() function in R is a versatile tool for combining datasets.
To perform a left join, you use the following syntax if you want to use the df_left as the starting point and extract common rows with a second df_right - use all.x=FALSE (known as inner join/common only join)
add to the df_left any columns that are in the second df_right and not in the starting df_left that share the same by=“variable” rows as the starting df_left use all.x=TRUE (known as left outer join)
Syntax: merge(df_left, df_right, by=“common column” to join on, all.x = TRUE/FALSE)
Inner join/common only
If you only want the rows in the left df that share the variable var1 in their rows with the df=y on the right
Omit all.x argument from merge above
As you see below it only yields the rows from df_left that share a common “var” with df_right
If df_right has additional columns that didn’t exist in df_left they will be added
merge() arranges the rows in ascending order based on the value of by=var1
Well intersect returns the rows that have common “by” value.
Regardless of how many times it matches in y df.
Case Study
site0_ and site1_ are all the active sensors in NY state for the years of 1999 & 2012 respectively
We need to find the common sensors that were used in both years so we can analyze the difference
data: yearly PM25
Common Rows
Per the definition of intersect, we are not looking for all the sensor rows we are just looking for the sensors that exist in both sets of data. We don’t care how often they occur just which ones occur at least ONCE
# could use: gt_preview(both, top_n = 15)both <-intersect(site0_, site1_)gt(data.frame(both)) |>opt_table_outline()
both
1.5
1.12
5.80
13.11
29.5
31.3
63.2008
67.1015
85.55
101.3
Case Study
So if we only want to extract the common value of var1 from both dfs this is the function to use
data: EPA PM2.5
If you remember from above: nrow(coal_rows) = 99
When we used match_df to check how many rows in NEI shared the same rows with coal_rows we had: nrow(coal_emissions) = 28480
So intersect by definition CANNOT have more values than the smallest of the two.
Meaning since our coal_rows = 99 we cannot have intersect yield a higher value than 99, because intersect counts as one if a value is shared and then skips the other times it might share again, and goes on to the next value in the coal_rows till it does it 99 times
Let’s see: well as it turns out that not all coal related combustions were measured in NEI