Percent & Summary


library(gt)
library(dplyr)
library(tidyverse)
library(stats)   # for quantile

Data


PM25

  • One of the datasets we’ll be using another version of the EPA PM2.5 data is brought over from Case Study - EPA - EDA 3.

preview

let’s look at pm0

tab_pm0 <- named_pm0[1:10,] |>
        gt() |> 
        tab_options(table.align = "center", table.width = pct(50)) |> 
        fmt_number(suffixing = FALSE) |> 
        opt_table_outline()
tab_pm0
State.Code County.Code Site.ID Date Sample.Value
1.00 27.00 1.00 19,990,103.00 NA
1.00 27.00 1.00 19,990,106.00 NA
1.00 27.00 1.00 19,990,109.00 NA
1.00 27.00 1.00 19,990,112.00 8.84
1.00 27.00 1.00 19,990,115.00 14.92
1.00 27.00 1.00 19,990,118.00 3.88
1.00 27.00 1.00 19,990,121.00 9.04
1.00 27.00 1.00 19,990,124.00 5.46
1.00 27.00 1.00 19,990,127.00 20.17
1.00 27.00 1.00 19,990,130.00 11.56

let’s look at x0

gt_preview(x0, top_n = 10)
data
1 NA
2 NA
3 NA
4 8.841
5 14.920
6 3.878
7 9.042
8 5.464
9 20.170
10 11.560
11..117420
117421 4.900

Percent of


PM25

NAs

  • Since NA has no value the mean function will count the occurrences of NA
  • Note: x0 & x1 are sensor readings for 2008 & 2012

What percentage of the x0 & x1 data are NAs? 11.25% and 5.6%

mean(is.na(x0))
[1] 0.1125608
mean(is.na(x1))
[1] 0.05607125

negative Values

  • Sensor readings should not have negative values, so before we decide what to do with those readings let’s find out how many are there
  • Note: we have over 1.3M rows of data
  • Let’s count them first
  • Then calculate the percentage

What percentage of the x0 & x1 sensor readings are negative? 0% and 2.15%

negative_x0 <- x0<0
sum(negative_x0, na.rm = TRUE)
[1] 0
mean(negative_x0, na.rm = TRUE)
[1] 0
negative_x1 <- x1<0
sum(negative_x1, na.rm = TRUE )
[1] 26474
mean(negative_x1, na.rm=TRUE)
[1] 0.0215034

Summary


Summary gives us

  • the breakdown of the range as well as
  • the quarterly breakdown
  • mean
  • median
  • count of NA’s

PM25

summary(x0)  # we could've use summary(named_pm0$Sample.Value)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    7.20   11.50   13.74   17.90  157.10   13217 
summary(x1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 -10.00    4.00    7.63    9.14   12.00  908.97   73133 

Observations:

  • Both mean and median decreased
  • NA’s increased tremendously
  • Negative values appeared in the data?

Quantile


quantile() produces sample quantiles corresponding to the given probabilities. The smallest corresponds to a probability of 0 and the largest to a probability of 1

Syntax: see Summarize - Arrange for more

Type:

  • Discontinuous sample quantile types 1, 2, and 3
  • Continuous sample quantile types 4 through 9
  • I’ll use type = 9 for normal distribution
  • Notice the values are identical to summary above how each quarter of the range falls at the same points
quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE, type = 7, …)
quantile(x1, probs = seq(0, 1, 0.25), na.rm = TRUE, names = TRUE, type = 9)
       0%       25%       50%       75%      100% 
-10.00000   4.00000   7.63333  12.00000 908.97000 

thirds

Let’s say we want to see the distribution in thirds instead of quarters?

quantile(x1, probs = seq(0, 1, 1/3), na.rm = TRUE, names = TRUE, type = 9)
       0% 33.33333% 66.66667%      100% 
   -10.00      5.20     10.10    908.97 
quantile(x1, probs = seq(0, 1, 1/5), na.rm = TRUE, names = TRUE, type = 9)
    0%    20%    40%    60%    80%   100% 
-10.00   3.50   6.00   9.00  13.60 908.97 

Range


range of x0 and x1

range(x0, na.rm=TRUE)
[1]   0.0 157.1
range(x1, na.rm=TRUE)
[1] -10.00 908.97