Extract Rows


1 var Extraction


Case Study

  • How much yearly emissions from coal combustion-related sources were there from 1999-2008
  • All the sources are in SCC$EI.Sector

data: EPA PM2.5

  • Same data used in Group - Tidy
  • Additional information on the data and file (summarySCC_PM25.rds):
  • This data was used in case study EPA - EDA 1
  • fips: A five-digit number (represented as a string) indicating the U.S. county
  • SCCSCC: The name of the source as indicated by a digit string (see source code classification table)
  • Pollutant: A string indicating the pollutant
  • Emissions: Amount of PM2.5 emitted, in tons
  • type: The type of source (point, non-point, on-road, or non-road)
  • year: The year of emissions recorded

Source Classification Code Table (Source_Classification_Code.rds):

  • This table provides a mapping from the SCC digit strings in the Emissions table to the actual name of the PM2.5 source.
library(tidyverse)
library(plyr)
library(plyr)   #for match_df

con1 <- file("D:/Education/R/Data/EPA/summarySCC_PM25.rds")
con2 <- file("D:/Education/R/Data/EPA/Source_Classification_Code.rds")
NEI <- readRDS(con1)
SCC <- readRDS(con2)
head(NEI)
    fips      SCC Pollutant Emissions  type year
4  09001 10100401  PM25-PRI    15.714 POINT 1999
8  09001 10100404  PM25-PRI   234.178 POINT 1999
12 09001 10100501  PM25-PRI     0.128 POINT 1999
16 09001 10200401  PM25-PRI     2.036 POINT 1999
20 09001 10200504  PM25-PRI     0.388 POINT 1999
24 09001 10200602  PM25-PRI     1.490 POINT 1999
unique(NEI$year)
[1] 1999 2002 2005 2008

Extract Rows from df1

  • So we need to scan through all the rows of EI.Sector variable and pull out all the rows related to coal combustion
  • From visually scanning the column it appears that any coal related readings would be labeled with “- Coal”
  • So as we did before for columns we can use grep/select or grepl/filter combination to extract the rows we need
  • As you see below each row has an SCC code associated with it

Extract conditional coal-related rows

coal_rows <- SCC |> filter(grepl("- Coal", EI.Sector))
nrow(coal_rows)
[1] 99
head(coal_rows[,c(1,3)], 5)
       SCC
1 10100101
2 10100102
3 10100201
4 10100202
5 10100203
                                                                  Short.Name
1                   Ext Comb /Electric Gen /Anthracite Coal /Pulverized Coal
2 Ext Comb /Electric Gen /Anthracite Coal /Traveling Grate (Overfeed) Stoker
3       Ext Comb /Electric Gen /Bituminous Coal /Pulverized Coal: Wet Bottom
4       Ext Comb /Electric Gen /Bituminous Coal /Pulverized Coal: Dry Bottom
5                   Ext Comb /Electric Gen /Bituminous Coal /Cyclone Furnace

Extract Matched Rows from df2

  • Now that we have all the rows that specify coal related emissions with an SCC code for each measurement
  • We need to extract all the rows from the original dataset NEI that are measurements of coal related emissions
  • In order to do that we can use match_df to extract all the rows in NEI that match with all the 99 rows in coal_rows
  • Similar to merge: we specify (data, vector to be matched, on = common column to look for the match)
  • As you see in the results: the 99 rows matched with 28480 rows in the original dataset

Match extracted coal specific rows with all the rows with the same SCC code in the original dataset, save into coal_emissions for analysis

coal_emissions <- match_df(NEI, coal_rows, on="SCC")
nrow(coal_emissions)
[1] 28480

aggregate sum per year

coal_emm_yearly <- coal_emissions |> 
        aggregate(Emissions ~ year, sum)
coal_emm_yearly
  year Emissions
1 1999  572126.5
2 2002  546789.2
3 2005  552881.5
4 2008  343432.2

2 var Extraction


Case Study

  • Var 1: Extract the rows/observations from sensors used in 1999 and 2012 - use the file both that was created in other pages which contains a list of common sensors
  • Var 2: Look for NY State code = 36 only

data

We’ll be using another version of the EPA PM2.5 data used in the the previous EPA - EDA pages. This version has been modified and saved locally in several csv files.

  • pm0 contains data from 1999

  • pm1 contains data from2012

  • cnames.txt is a variable containing list of characters for column names

  • wcol_df is a df holds the indices of the 5 columns we

  • site0 and site1 are for sensors in the NY state area for 1999 and 2012 respectively described as County.Code and Site.ID concatenated together with “.” as seperator

  • both contains the list of sensors that were in use for NY State in both years 1999 and 2012

library(gt)
library(dplyr)
library(tidyverse)
library(stats)   # for quantile

Extract Rows

Let’s extract the observations for those sensors that are in common for both time dfs - remember State.code for NY is == 36 and - the common sensors are in both

%in%

  • we’ll use %in% which means exactly what it sounds like
cnt0 <- subset(named_pm0, State.Code == 36 & county.site %in% both)
cnt1 <- subset(named_pm1, State.Code == 36 & county.site %in% both)
head(cnt0)
      State.Code County.Code Site.ID     Date Sample.Value county.site
65873         36           1       5 19990702           NA         1.5
65874         36           1       5 19990705           NA         1.5
65875         36           1       5 19990708           NA         1.5
65876         36           1       5 19990711           NA         1.5
65877         36           1       5 19990714         11.8         1.5
65878         36           1       5 19990717         49.4         1.5
head(cnt1)
       State.Code County.Code Site.ID     Date Sample.Value county.site
835337         36           1       5 20120101         7.04         1.5
835338         36           1       5 20120102         2.08         1.5
835339         36           1       5 20120103         3.58         1.5
835340         36           1       5 20120104         6.45         1.5
835341         36           1       5 20120107        17.29         1.5
835342         36           1       5 20120110           NA         1.5