Percent & Summary

library(gt)
library(dplyr)
library(tidyverse)
library(stats)   # for quantile

Data

PM25

One of the datasets we’ll be using another version of the EPA PM2.5 data is brought over from Case Study - EPA - EDA 3.

preview

let’s look at pm0

tab_pm0 <- named_pm0[1:10,] |>
        gt() |> 
        tab_options(table.align = "center", table.width = pct(50)) |> 
        fmt_number(suffixing = FALSE) |> 
        opt_table_outline()
tab_pm0

State.Code	County.Code	Site.ID	Date	Sample.Value
1.00	27.00	1.00	19,990,103.00	NA
1.00	27.00	1.00	19,990,106.00	NA
1.00	27.00	1.00	19,990,109.00	NA
1.00	27.00	1.00	19,990,112.00	8.84
1.00	27.00	1.00	19,990,115.00	14.92
1.00	27.00	1.00	19,990,118.00	3.88
1.00	27.00	1.00	19,990,121.00	9.04
1.00	27.00	1.00	19,990,124.00	5.46
1.00	27.00	1.00	19,990,127.00	20.17
1.00	27.00	1.00	19,990,130.00	11.56

let’s look at x0

gt_preview(x0, top_n = 10)

	data
1	NA
2	NA
3	NA
4	8.841
5	14.920
6	3.878
7	9.042
8	5.464
9	20.170
10	11.560
11..117420
117421	4.900

Percent of

PM25

NAs

Since NA has no value the mean function will count the occurrences of NA
Note: x0 & x1 are sensor readings for 2008 & 2012

What percentage of the x0 & x1 data are NAs? 11.25% and 5.6%

mean(is.na(x0))

[1] 0.1125608

mean(is.na(x1))

[1] 0.05607125

negative Values

Sensor readings should not have negative values, so before we decide what to do with those readings let’s find out how many are there
Note: we have over 1.3M rows of data
Let’s count them first
Then calculate the percentage

What percentage of the x0 & x1 sensor readings are negative? 0% and 2.15%

negative_x0 <- x0<0
sum(negative_x0, na.rm = TRUE)

[1] 0

mean(negative_x0, na.rm = TRUE)

[1] 0

negative_x1 <- x1<0
sum(negative_x1, na.rm = TRUE )

[1] 26474

mean(negative_x1, na.rm=TRUE)

[1] 0.0215034

Summary

Summary gives us

the breakdown of the range as well as
the quarterly breakdown
mean
median
count of NA’s

PM25

summary(x0)  # we could've use summary(named_pm0$Sample.Value)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    7.20   11.50   13.74   17.90  157.10   13217

summary(x1)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 -10.00    4.00    7.63    9.14   12.00  908.97   73133

Observations:

Both mean and median decreased
NA’s increased tremendously
Negative values appeared in the data?

Quantile

quantile() produces sample quantiles corresponding to the given probabilities. The smallest corresponds to a probability of 0 and the largest to a probability of 1

Syntax: see Summarize - Arrange for more

Type:

Discontinuous sample quantile types 1, 2, and 3
Continuous sample quantile types 4 through 9
I’ll use type = 9 for normal distribution
Notice the values are identical to summary above how each quarter of the range falls at the same points

quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE, type = 7, …)

quantile(x1, probs = seq(0, 1, 0.25), na.rm = TRUE, names = TRUE, type = 9)

       0%       25%       50%       75%      100% 
-10.00000   4.00000   7.63333  12.00000 908.97000

thirds

Let’s say we want to see the distribution in thirds instead of quarters?

quantile(x1, probs = seq(0, 1, 1/3), na.rm = TRUE, names = TRUE, type = 9)

       0% 33.33333% 66.66667%      100% 
   -10.00      5.20     10.10    908.97

quantile(x1, probs = seq(0, 1, 1/5), na.rm = TRUE, names = TRUE, type = 9)

    0%    20%    40%    60%    80%   100% 
-10.00   3.50   6.00   9.00  13.60 908.97

Range

range of x0 and x1

range(x0, na.rm=TRUE)

[1]   0.0 157.1

range(x1, na.rm=TRUE)

[1] -10.00 908.97