library(gt)
library(dplyr)
library(tidyverse)
library(stats) # for quantile
Percent & Summary
Data
PM25
- One of the datasets we’ll be using another version of the EPA PM2.5 data is brought over from Case Study - EPA - EDA 3.
preview
let’s look at pm0
<- named_pm0[1:10,] |>
tab_pm0 gt() |>
tab_options(table.align = "center", table.width = pct(50)) |>
fmt_number(suffixing = FALSE) |>
opt_table_outline()
tab_pm0
State.Code | County.Code | Site.ID | Date | Sample.Value |
---|---|---|---|---|
1.00 | 27.00 | 1.00 | 19,990,103.00 | NA |
1.00 | 27.00 | 1.00 | 19,990,106.00 | NA |
1.00 | 27.00 | 1.00 | 19,990,109.00 | NA |
1.00 | 27.00 | 1.00 | 19,990,112.00 | 8.84 |
1.00 | 27.00 | 1.00 | 19,990,115.00 | 14.92 |
1.00 | 27.00 | 1.00 | 19,990,118.00 | 3.88 |
1.00 | 27.00 | 1.00 | 19,990,121.00 | 9.04 |
1.00 | 27.00 | 1.00 | 19,990,124.00 | 5.46 |
1.00 | 27.00 | 1.00 | 19,990,127.00 | 20.17 |
1.00 | 27.00 | 1.00 | 19,990,130.00 | 11.56 |
let’s look at x0
gt_preview(x0, top_n = 10)
data | |
---|---|
1 | NA |
2 | NA |
3 | NA |
4 | 8.841 |
5 | 14.920 |
6 | 3.878 |
7 | 9.042 |
8 | 5.464 |
9 | 20.170 |
10 | 11.560 |
11..117420 | |
117421 | 4.900 |
Percent of
PM25
NAs
- Since NA has no value the mean function will count the occurrences of NA
- Note: x0 & x1 are sensor readings for 2008 & 2012
What percentage of the x0 & x1 data are NAs? 11.25% and 5.6%
mean(is.na(x0))
[1] 0.1125608
mean(is.na(x1))
[1] 0.05607125
negative Values
- Sensor readings should not have negative values, so before we decide what to do with those readings let’s find out how many are there
- Note: we have over 1.3M rows of data
- Let’s count them first
- Then calculate the percentage
What percentage of the x0 & x1 sensor readings are negative? 0% and 2.15%
<- x0<0
negative_x0 sum(negative_x0, na.rm = TRUE)
[1] 0
mean(negative_x0, na.rm = TRUE)
[1] 0
<- x1<0
negative_x1 sum(negative_x1, na.rm = TRUE )
[1] 26474
mean(negative_x1, na.rm=TRUE)
[1] 0.0215034
Summary
Summary gives us
- the breakdown of the range as well as
- the quarterly breakdown
- mean
- median
- count of NA’s
PM25
summary(x0) # we could've use summary(named_pm0$Sample.Value)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 7.20 11.50 13.74 17.90 157.10 13217
summary(x1)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-10.00 4.00 7.63 9.14 12.00 908.97 73133
Observations:
- Both mean and median decreased
- NA’s increased tremendously
- Negative values appeared in the data?
Quantile
quantile() produces sample quantiles corresponding to the given probabilities. The smallest corresponds to a probability of 0 and the largest to a probability of 1
Syntax: see Summarize - Arrange for more
Type:
- Discontinuous sample quantile types 1, 2, and 3
- Continuous sample quantile types 4 through 9
- I’ll use type = 9 for normal distribution
- Notice the values are identical to summary above how each quarter of the range falls at the same points
quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE, type = 7, …)
quantile(x1, probs = seq(0, 1, 0.25), na.rm = TRUE, names = TRUE, type = 9)
0% 25% 50% 75% 100%
-10.00000 4.00000 7.63333 12.00000 908.97000
thirds
Let’s say we want to see the distribution in thirds instead of quarters?
quantile(x1, probs = seq(0, 1, 1/3), na.rm = TRUE, names = TRUE, type = 9)
0% 33.33333% 66.66667% 100%
-10.00 5.20 10.10 908.97
quantile(x1, probs = seq(0, 1, 1/5), na.rm = TRUE, names = TRUE, type = 9)
0% 20% 40% 60% 80% 100%
-10.00 3.50 6.00 9.00 13.60 908.97
Range
range of x0 and x1
range(x0, na.rm=TRUE)
[1] 0.0 157.1
range(x1, na.rm=TRUE)
[1] -10.00 908.97