Match - Intersect


  • There are times we need to find the rows/columns in a large dataset that match the rows/columns of a smaller control dataset.
  • In some cases we can use join, merge, intersect, and other functions if applicable, in this case we’ll use match_df
  • match_df is covered in both Join - Merge & Filter - Subset pages, it is also covered in Extract Rows
  • I’ll cover it here because I’ll have other ways to accomplish the same detailed as well

data: EPA PM2.5

  • Same data used in Group - Tidy
  • Additional information on the data and file (summarySCC_PM25.rds):
  • This data was used in case study EPA - EDA 1
  • fips: A five-digit number (represented as a string) indicating the U.S. county
  • SCCSCC: The name of the source as indicated by a digit string (see source code classification table)
  • Pollutant: A string indicating the pollutant
  • Emissions: Amount of PM2.5 emitted, in tons
  • type: The type of source (point, non-point, on-road, or non-road)
  • year: The year of emissions recorded

Source Classification Code Table (Source_Classification_Code.rds):

  • This table provides a mapping from the SCC digit strings in the Emissions table to the actual name of the PM2.5 source.
library(tidyverse)
library(plyr)   #for match_df
library(gt)

con1 <- file("D:/Education/R/Data/EPA/summarySCC_PM25.rds")
con2 <- file("D:/Education/R/Data/EPA/Source_Classification_Code.rds")
NEI <- readRDS(con1)
SCC <- readRDS(con2)
head(NEI)
    fips      SCC Pollutant Emissions  type year
4  09001 10100401  PM25-PRI    15.714 POINT 1999
8  09001 10100404  PM25-PRI   234.178 POINT 1999
12 09001 10100501  PM25-PRI     0.128 POINT 1999
16 09001 10200401  PM25-PRI     2.036 POINT 1999
20 09001 10200504  PM25-PRI     0.388 POINT 1999
24 09001 10200602  PM25-PRI     1.490 POINT 1999
unique(NEI$year)
[1] 1999 2002 2005 2008

case study

  • How much yearly emissions from coal combustion-related sources were there from 1999-2008
  • All the sources are in SCC$EI.Sector

extract conditional coal_rows

  • So we need to scan through all the rows of EI.Sector variable and pull out all the rows related to coal combustion
  • From visually scanning the column it appears that any coal related readings would be labeled with “- Coal”
  • So as we did before for columns we can use grep/select or grepl/filter combination to extract the rows we need
  • As you see below each row has an SCC code associated with it

Extract conditional coal-related rows

coal_rows <- SCC |> filter(grepl("- Coal", EI.Sector))
nrow(coal_rows)
[1] 99
#head(coal_rows[,c(1,3)], 5)
# can use this as well
coal_rows |> 
        select(c(1,3)) |> 
        gt_preview(top_n = 5)
gt(coal_rows[1:5,c(1,3)]) |> opt_table_outline()
SCC Short.Name
10100101 Ext Comb /Electric Gen /Anthracite Coal /Pulverized Coal
10100102 Ext Comb /Electric Gen /Anthracite Coal /Traveling Grate (Overfeed) Stoker
10100201 Ext Comb /Electric Gen /Bituminous Coal /Pulverized Coal: Wet Bottom
10100202 Ext Comb /Electric Gen /Bituminous Coal /Pulverized Coal: Dry Bottom
10100203 Ext Comb /Electric Gen /Bituminous Coal /Cyclone Furnace

Match_df


  • Now that we have all the rows that specify coal related emissions with an SCC code for each measurement
  • We need to extract all the rows from the original dataset NEI that are measurements of coal related emissions
  • In order to do that we can use match_df to extract all the rows in NEI that match with all the 99 rows in coal_rows
  • Similar to merge: we specify (data, vector to be matched, on = common column to look for the match)
  • As you see in the results: the 99 rows matched with 28480 rows in the original dataset

Match extracted coal specific rows with all the rows with the same SCC code in the original dataset, save into coal_emissions for analysis

coal_emissions <- match_df(NEI, coal_rows, on="SCC")
nrow(coal_emissions)
[1] 28480

aggregate sum of coal emissions per year

coal_emm_yearly <- coal_emissions |> 
        aggregate(Emissions ~ year, sum)
coal_emm_yearly
  year Emissions
1 1999  572126.5
2 2002  546789.2
3 2005  552881.5
4 2008  343432.2

Merge


The merge() function in R is a versatile tool for combining datasets.

  • To perform a left join, you use the following syntax if you want to use the df_left as the starting point and extract common rows with a second df_right - use all.x=FALSE (known as inner join/common only join)
  • add to the df_left any columns that are in the second df_right and not in the starting df_left that share the same by=“variable” rows as the starting df_left use all.x=TRUE (known as left outer join)

Syntax: merge(df_left, df_right, by=“common column” to join on, all.x = TRUE/FALSE)

Inner join/common only

  • If you only want the rows in the left df that share the variable var1 in their rows with the df=y on the right
  • Omit all.x argument from merge above
  • As you see below it only yields the rows from df_left that share a common “var” with df_right
  • If df_right has additional columns that didn’t exist in df_left they will be added
  • merge() arranges the rows in ascending order based on the value of by=var1
merged_coal_emissions <- merge(NEI, coal_rows, by="SCC", all.x=FALSE)
nrow(merged_coal_emissions)
[1] 28480

Aggregate sum of coal emissions per year as we did above to see if we come up with the same results using merge

merged_coal_emissions |> aggregate(Emissions ~ year, sum)
  year Emissions
1 1999  572126.5
2 2002  546789.2
3 2005  552881.5
4 2008  343432.2

Intersect


How is intersect different?

  • Well intersect returns the rows that have common “by” value.
  • Regardless of how many times it matches in y df.

Case Study

  • site0_ and site1_ are all the active sensors in NY state for the years of 1999 & 2012 respectively
  • We need to find the common sensors that were used in both years so we can analyze the difference

data: yearly PM25

Common Rows

Per the definition of intersect, we are not looking for all the sensor rows we are just looking for the sensors that exist in both sets of data. We don’t care how often they occur just which ones occur at least ONCE

# could use: gt_preview(both, top_n = 15)
both <-  intersect(site0_, site1_)
gt(data.frame(both)) |> opt_table_outline()  
both
1.5
1.12
5.80
13.11
29.5
31.3
63.2008
67.1015
85.55
101.3

Case Study

  • So if we only want to extract the common value of var1 from both dfs this is the function to use

data: EPA PM2.5

  • If you remember from above: nrow(coal_rows) = 99
  • When we used match_df to check how many rows in NEI shared the same rows with coal_rows we had: nrow(coal_emissions) = 28480
  • So intersect by definition CANNOT have more values than the smallest of the two.
  • Meaning since our coal_rows = 99 we cannot have intersect yield a higher value than 99, because intersect counts as one if a value is shared and then skips the other times it might share again, and goes on to the next value in the coal_rows till it does it 99 times
  • Let’s see: well as it turns out that not all coal related combustions were measured in NEI

Common Columns

intersected <- intersect(NEI$SCC, coal_rows$SCC)
length(intersected)
[1] 80