library(palmerpenguins)
library(dplyr)
library(reshape2) # for melt
Summarize - Arrange
Data
summarise
Summarize/summarise -> We can also calculate the stats from summary() and aggregate for multiple variables and calculations after we group the data in categories. I tend to use this more than the summary() and aggregate()
Here next I’ll group the data into member_casual (which tells us what type of user we have) then we can summarise the data for each category similar to above but this time we do separate it for each category in the dataset:
# GROUP BY TYPE OF RIDER THEN CALCULATE STATISTICS
%>%
all_trips19_20_v3 group_by(member_casual) %>%
summarise(max = max(ride_length),
min = min(ride_length),
median = median(ride_length),
mean = mean(ride_length))
member_casual max min median mean<chr> <dbl> <dbl> <dbl> <dbl>
1 Customer 10632022 2 1393 5373.
2 Subscriber 6096428 1 508 795.
split
Split could be used see conditionals - Looping F page for more. Here is a short example
#use split to create 54 df and a count of occurences for State in each
<- sapply(split(gooddata, gooddata$State),count) splitdata
count
Count could be used to group as well, look at this, from 3000 rows it counts how many time each State occured and provides a list of all states sorted in ascending order
#creates a list of states and counts
<- gooddata |>
testcount count(State, sort = TRUE)
quantile
See How To: Percent & Summary for examples
quantile() produces sample quantiles corresponding to the given probabilities. The smallest corresponds to a probability of 0 and the largest to a probability of 1
Syntax:
quantile(x, probs = seq(0, 1, 1/4), na.rm = FALSE, names = TRUE, type = 7, …)
- x numeric vector whose sample quantiles are wanted, or an object of a class for which a method has been defined (see also ‘details’). NA and NaN values are not allowed in numeric vectors unless na.rm is TRUE.
- probs numeric vector of probabilities with values in . (Values up to 2e-14 outside that range are accepted and moved to the nearby endpoint.)
- na.rm logical; if true, any NA and NaN’s are removed from x before the quantiles are computed.
- names logical; if true, the result has a names attribute. Set to FALSE for speedup with many probs.
- type an integer between 1 and 9 selecting one of the nine quantile algorithms detailed below to be used.
- Discontinuous sample quantile types 1, 2, and 3
- Continuous sample quantile types 4 through 9
- … further arguments passed to or from other methods.
quantile(x1, probs = seq(0, 1, 1/4), na.rm = TRUE, names = TRUE, type = 9)
# if the data doesn't conatin NAs we can use
quantile(data, probs = seq(0, 1, 1/4))
n_distinct
n_distinct() counts the number of unique/distinct combinations in a set of one or more vectors. It’s a faster and more concise equivalent to nrow(unique(data.frame(…))). Usage: n_distinct(…, na.rm = FALSE)
<- c(1, 1, 2, 2, 2)
x n_distinct(x) # two distinct values in x
[1] 2
<- c(3, 3, NA, 3, 3)
y n_distinct(y) # two distinct values as well, because na.rm defaults to FALSE
[1] 2
n_distinct(y, na.rm = TRUE)
[1] 1
n_distinct(x,y) # I don't see how this is not 4!
[1] 3
n_distinct(x,y, na.rm = TRUE) # we pair 1, 2 with 3 and we get 1,3 - 2,3
[1] 2
Let’s use it with the comparecare dataset as above with datain being the original data in the function, and we want to find out how many distinct States/column[7] are in the data
<- n_distinct(datain$State) dis_list
nrow
counts the number of rows in a group/df/list…
group_by
Let’s say we want to know the breakdown for each category of rider and for each day of the week. So we have to group the data into 2 categories based on 2 variables, then calculate what we need.
# USE SUMMARISE AND GROUP_BY WITH MULTIPLE VARIABLES
%>%
trips_19_20 group_by(member_casual, week_day) %>%
summarise(
max = max(ride_length),
min = min(ride_length),
median = median(ride_length),
mean = mean(ride_length))
`summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
# A tibble: 14 × 6
# Groups: member_casual [2] ...
member_casual week_day max min median mean <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Customer 1 8636205 3 1606. 5061.
2 Customer 2 2363255 8 1194 4752.
3 Customer 3 4170291 11 1180 4562.
4 Customer 4 4482201 6 1426. 4480.
...
add other columns
# Add count column
%>%
trips_19_20 group_by(member_casual, week_day) %>%
summarise(max = max(ride_length),
min = min(ride_length),
median = median(ride_length),
mean = mean(ride_length),
number_of_rides =n())
arrange(member_casual, week_day)
Table
The table() function in R can be used to quickly create frequency tables. Instead of checking for unique/distinct values we can create a table for a specific column to see how may items appear and at what frequency. For example, we just used a case() function on column member_casual to replace some values, just to make sure we ended up with ONLY two desired values and we didn’t miss anything we can use the colde below and we get the two values we want.
So table(df$column) gives us the breakdown of all values in that column
table(all_trips19_20$member_casual)
OUTPUT
Customer Subscriber71643 720313
useNA
ifany
If we want to still include NA values in the table we can useNA() like this
table(table_blah$zipcodes_col, useNA = "ifany")
Summary
Is a generic function used to produce result summaries of the results of various model fitting functions. The function invokes particular methods
which depend on the class
of the first argument. What summary does could be done in individual commands as well as I’ll show below:
#______MAX, MIN, MEAN, MEDIAN FOR ride_length individually
= round(min(all_trips19_20_v3$ ride_length)/3600, 2)
min_trip_hours = round(max(all_trips19_20_v3$ ride_length)/60,2)
max_trip_minutes = round(max(all_trips19_20_v3$ ride_length)/3600, 2)
max_trip_hours median(all_trips19_20_v3$ ride_length)
mean(all_trips19_20_v3$ ride_length)
#_______________________DO IT ALL IN ONE STATEMENT
summary(all_trips19_20_v3$ride_length )
1st Qu. Median Mean 3rd Qu. Max.
Min. 1 331 539 1189 912 10632022
Aggregate
- Splits the data into subsets
- computes summary statistics for each
- returns the result in a convenient form
- Very similar to the combination of summarise() and groupby above but can only be applied for one calculation, while summarise above can be a combination of many (min, max, median, mean…)
The issue with aggregate is if you want to calculate, mean, median, max, min… and other calculations, you’ll need a command for each one. So look below (summarise) for the combination of group_by and summarise to solve that with just one command
In order to use the aggregate
function for mean in R, you will need to specify the numerical variable on the first argument, the categorical (as a list) on the second and the function to be applied (in this case mean
) on the third. An alternative is to specify a formula of the form: numerical ~ categorical
Two variables
- aggregate() will calculate what we want for a specific group from the dataset.
- here below we’ll calculate the mean of the ride_length for each category in member_casual
- Notice the column names correspond to the columns we match for, not for the mean
- So if we want to rename the column name from ride_length to mean we’d have to do that in a separate command
- Or we can just use the summarise() & groupby combination from above
- The variable on the right of the ~ is the grouped by variable
- The variable on the left side of the ~ is what we are applying the function to (mean)
- So here below: we are calculating the mean of the ride_length for every group in member_casual
#__________________AGGREGATE TWO VARIABLES
%>%
all_trips19_20_v3 aggregate(ride_length ~ member_casual, mean)
member_casual ride_length 1 Customer 5372.7839
2 Subscriber 794.9416
Three variables
- Now let’s breakdown the same statistic: the mean of ride_length for each group type of rider
- but now let’s go further and break it down for each day of the week so we add + day_of_week
#______AGGREGATE WITH THREE VARIABLES
%>%
all_trips19_20_v3 aggregate(ride_length ~ member_casual + day_of_week, mean)
member_casual day_of_week ride_length1 Customer Friday 6090.7373
2 Subscriber Friday 796.7338
3...
Now let’s calculate the mean of ride_length for each day of the week and further for each group type in member_casual
# ______WHAT IF WE REVERSE THE LAST TWO VARIABLES
%>%
all_trips19_20_v3 aggregate(ride_length ~ day_of_week + member_casual, mean)
day_of_week member_casual ride_length1 Friday Customer 6090.7373
2 Monday Customer 4752.0504
.....8 Friday Subscriber 3578.2457...
Arrange
sort
- Just as it sounds, default is ascending/increasing.
- To reverse we set decreasing=TRUE
<- data.frame("var1"=c(2,5,3,4,1), "var2"=c(4,2,6,NA,3))
x x
var1 var2
1 2 4
2 5 2
3 3 6
4 4 NA
5 1 3
$var1 x
[1] 2 5 3 4 1
sort(x$var1)
[1] 1 2 3 4 5
sort(x$var1, decreasing = TRUE)
[1] 5 4 3 2 1
order
Is another way to organize the objects in order. If we use multiple indexes, R will order with the first one, and if a tie exists it will order again based on the second index to break the tie
x
var1 var2
1 2 4
2 5 2
3 3 6
4 4 NA
5 1 3
order(x$var1),] x[
var1 var2
5 1 3
1 2 4
3 3 6
4 4 NA
2 5 2
arrange
To arrange the column bill_length_mm in ascending order. Asc is default.
arrange()
orders the rows of a data frame by the values of selected columns.
- Unlike other dplyr verbs,
arrange()
largely ignores grouping; you need to explicitly mention grouping variables (or use.by_group = TRUE
) in order to group by them, and functions of variables are evaluated once per data frame, not once per group. - So first group your data, then use arrange() with .by_group=TRUE to arrange by groups.
- arrange() does not group for you, data has to have been grouped already
arrange & group
|> arrange(bill_length_mm) #same as penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Dream 32.1 15.5 188 3050
2 Adelie Dream 33.1 16.1 178 2900
3 Adelie Torgersen 33.5 19 190 3600
4 Adelie Dream 34 17.1 185 3400
5 Adelie Torgersen 34.1 18.1 193 3475
6 Adelie Torgersen 34.4 18.4 184 3325
7 Adelie Biscoe 34.5 18.1 187 2900
8 Adelie Torgersen 34.6 21.1 198 4400
9 Adelie Torgersen 34.6 17.2 189 3200
10 Adelie Biscoe 35 17.9 190 3450
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
|> arrange(bill_length_mm, .by_group = FALSE) penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Dream 32.1 15.5 188 3050
2 Adelie Dream 33.1 16.1 178 2900
3 Adelie Torgersen 33.5 19 190 3600
4 Adelie Dream 34 17.1 185 3400
5 Adelie Torgersen 34.1 18.1 193 3475
6 Adelie Torgersen 34.4 18.4 184 3325
7 Adelie Biscoe 34.5 18.1 187 2900
8 Adelie Torgersen 34.6 21.1 198 4400
9 Adelie Torgersen 34.6 17.2 189 3200
10 Adelie Biscoe 35 17.9 190 3450
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
|> arrange (desc(bill_length_mm)) #to arrange in desc order penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 59.6 17 230 6050
2 Chinstrap Dream 58 17.8 181 3700
3 Gentoo Biscoe 55.9 17 228 5600
4 Chinstrap Dream 55.8 19.8 207 4000
5 Gentoo Biscoe 55.1 16 230 5850
6 Gentoo Biscoe 54.3 15.7 231 5650
7 Chinstrap Dream 54.2 20.8 201 4300
8 Chinstrap Dream 53.5 19.9 205 4500
9 Gentoo Biscoe 53.4 15.8 219 5500
10 Chinstrap Dream 52.8 20 205 4550
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
|> arrange(desc(bill_length_mm), .by_group = TRUE) #to group as well penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 59.6 17 230 6050
2 Chinstrap Dream 58 17.8 181 3700
3 Gentoo Biscoe 55.9 17 228 5600
4 Chinstrap Dream 55.8 19.8 207 4000
5 Gentoo Biscoe 55.1 16 230 5850
6 Gentoo Biscoe 54.3 15.7 231 5650
7 Chinstrap Dream 54.2 20.8 201 4300
8 Chinstrap Dream 53.5 19.9 205 4500
9 Gentoo Biscoe 53.4 15.8 219 5500
10 Chinstrap Dream 52.8 20 205 4550
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Notice that when you just run `arrange()` without saving your data to a new data frame, it does not alter the existing data frame. It just provides value as you work and lose it unless you assign it to a vector/list/df/…
multiple columns
# Arrange in ascending order of rate=col[3] first and name=col[2] second to break a tie if it exists #and assign the row number to the new rank column
<- gooddata |>
blah arrange((gooddata[3]), gooddata[2]) |>
mutate(rank = row_number())
across
dplyr package: across()
makes it easy to apply the same transformation to multiple columns, allowing you to use select()
semantics inside in “data-masking” functions like summarise()
and mutate()
- If we want to arrange the entire dataset
- in descending order
- using any column that starts with “Sepal”
- we use
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
- As you see below the entire dataset is arranged, sorted, ordered in descending order starting with
- the first column that starts with “Sepal”
- then it moves across to the second column that starts_with “Sepal” and arranges the data again
- you can see it clearly at row 9-12
- note what it does when the first column is 7.2
<- iris |> arrange(across(starts_with("Sepal"),desc))
desc_across head(desc_across,20)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.9 3.8 6.4 2.0 virginica
2 7.7 3.8 6.7 2.2 virginica
3 7.7 3.0 6.1 2.3 virginica
4 7.7 2.8 6.7 2.0 virginica
5 7.7 2.6 6.9 2.3 virginica
6 7.6 3.0 6.6 2.1 virginica
7 7.4 2.8 6.1 1.9 virginica
8 7.3 2.9 6.3 1.8 virginica
9 7.2 3.6 6.1 2.5 virginica
10 7.2 3.2 6.0 1.8 virginica
11 7.2 3.0 5.8 1.6 virginica
12 7.1 3.0 5.9 2.1 virginica
13 7.0 3.2 4.7 1.4 versicolor
14 6.9 3.2 5.7 2.3 virginica
15 6.9 3.1 4.9 1.5 versicolor
16 6.9 3.1 5.4 2.1 virginica
17 6.9 3.1 5.1 2.3 virginica
18 6.8 3.2 5.9 2.3 virginica
19 6.8 3.0 5.5 2.1 virginica
20 6.8 2.8 4.8 1.4 versicolor
Options
sum
You can use sum to count all the occurences of a variable at a condition, for example
sum((all_trips19_20$ride_length == 0))
1] 93 [
mean
mean avg
Remember to always use the dataset and $ followed by column. If you had used the arrange function prior to calculating the mean/avg/min/max you’d get the same value because arrange doesn’t alter the dataset, just arranges it
mean(penguins$bill_length_mm)
[1] NA
max() min()
To find out the maximum and minimum lead times without sorting the whole dataset using the `arrange()` function. You need to specify which dataset and which column using the $ symbol between their names.
max(penguins$bill_length_mm)
[1] NA
min(penguins$bill_length_mm)
[1] NA
Reshape
wide to long
When organizing or tidying your data using R, you might need to convert wide data to long data or long to wide. Recall that this is what data looks like in a wide format spreadsheet:
Wide data has observations across several columns. Each column contains data from a different condition of the variable. In this example the columns are different years.
Now check out the same data in a long format:
To review what you already learned about the difference, long data has all the observations in a single column, and the variable conditions are placed into separate rows.
reshape package is no longer used, instead use tidyr()
melt
melt is data.table’s wide-to-long reshaping tool. We provide an S3 method for melting data.table>s. It is written in C for speed and memory efficiency.
Syntax:
melt(data, id.vars, measure.vars, variable.name = "variable", value.name="value",
na.rm=FALSE, variable.factor=TRUE, value.factor=FALSE,
..., verbose=getOption("datable.verbose"))
So here we are going to use the mtcars data and we want to assign some variables as ID vars, and some will be Measure vars.
First of all:
- The first column is not named, the column with all the car names so it appears that these are rownames so let’s assign the rownames to a var
- We want to have 3 columns for carname, gear, cyl
- We also want to assign two variables to the rows of each of the columns we chose above
- So we melt it into id=c(column names)
- measure.vars =c(mpg, hp)
- So it converts the data into a long and skinny one listing the mpg values for every combination of carname, gear, and cyl
- When it’s done with the first variable mpg it’ll create the data for the second variable= hp
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
$carname <- rownames(mtcars)
mtcars<- melt(mtcars,
carMelt id=c("carname", "gear", "cyl"),
measure.vars=c("mpg", "hp"))
head(carMelt)
carname gear cyl variable value
1 Mazda RX4 4 6 mpg 21.0
2 Mazda RX4 Wag 4 6 mpg 21.0
3 Datsun 710 4 4 mpg 22.8
4 Hornet 4 Drive 3 6 mpg 21.4
5 Hornet Sportabout 3 8 mpg 18.7
6 Valiant 3 6 mpg 18.1
tail(carMelt)
carname gear cyl variable value
59 Porsche 914-2 5 4 hp 91
60 Lotus Europa 5 4 hp 113
61 Ford Pantera L 5 8 hp 264
62 Ferrari Dino 5 6 hp 175
63 Maserati Bora 5 8 hp 335
64 Volvo 142E 4 4 hp 109
- So it basically reshaped the data set so that it’s tall and skinny and so there’s one row for every mpg and one row for every hp.
- So once we’ve melted the data set we can recast it in a bunch of different ways
- We can basically reformat the data set into different sort of shapes
So to do that we’re gonna use the dcast function.
dcast
The dcast function will recast the data set into a particular shape, into a particular data frame. So here, what we’re going to do is, we’re going to pass it the melted data set.
- And suppose, we wanted to see the cylinders, and we wanted to see that broken down by the different variables.
- So as you see it will take the ID variable = cylinder and for each it will display the value variables: mpg & hp as specified above.
- So we are putting the ID cyl in the rows and the value vars in the columns.
- The results don’t really mean anything because if you read the warning given by dcast: “Aggregation function missing: defaulting to length”
- It tells us that there are 11 rows for 4 cyl type cars that have mpg values and hp values
- There are 7 rows for 6 cyl type cars
- There are 14 rows for 8 cyl type cars
<-dcast(carMelt, cyl ~ variable) cylData
Aggregation function missing: defaulting to length
cylData
cyl mpg hp
1 4 11 11
2 6 7 7
3 8 14 14
Here is another example of using dcast, but this time
- Let’s cast the variable: mpg & hp onto carnames
<- dcast(carMelt, carname ~ variable)
c_names c_names
carname mpg hp
1 AMC Javelin 15.2 150
2 Cadillac Fleetwood 10.4 205
3 Camaro Z28 13.3 245
4 Chrysler Imperial 14.7 230
5 Datsun 710 22.8 93
6 Dodge Challenger 15.5 150
7 Duster 360 14.3 245
8 Ferrari Dino 19.7 175
9 Fiat 128 32.4 66
10 Fiat X1-9 27.3 66
11 Ford Pantera L 15.8 264
12 Honda Civic 30.4 52
13 Hornet 4 Drive 21.4 110
14 Hornet Sportabout 18.7 175
15 Lincoln Continental 10.4 215
16 Lotus Europa 30.4 113
17 Maserati Bora 15.0 335
18 Mazda RX4 21.0 110
19 Mazda RX4 Wag 21.0 110
20 Merc 230 22.8 95
21 Merc 240D 24.4 62
22 Merc 280 19.2 123
23 Merc 280C 17.8 123
24 Merc 450SE 16.4 180
25 Merc 450SL 17.3 180
26 Merc 450SLC 15.2 180
27 Pontiac Firebird 19.2 175
28 Porsche 914-2 26.0 91
29 Toyota Corolla 33.9 65
30 Toyota Corona 21.5 97
31 Valiant 18.1 105
32 Volvo 142E 21.4 109
mean & dcast
- Since we have multiple cars in each cyl classifications
- To calculate the mean for it we just add the function in the statement
- dcast will give us the mean for all cars in each of the variables for each cyl class
<- dcast(carMelt, cyl ~ variable, mean)
cylData cylData
cyl mpg hp
1 4 26.66364 82.63636
2 6 19.74286 122.28571
3 8 15.10000 209.21429