Summarize - Arrange


Data


library(palmerpenguins)
library(dplyr)
library(reshape2)  # for melt

summarise

Summarize/summarise -> We can also calculate the stats from summary() and aggregate for multiple variables and calculations after we group the data in categories. I tend to use this more than the summary() and aggregate()

Here next I’ll group the data into member_casual (which tells us what type of user we have) then we can summarise the data for each category similar to above but this time we do separate it for each category in the dataset:

# GROUP BY TYPE OF RIDER THEN CALCULATE STATISTICS
all_trips19_20_v3 %>% 
        group_by(member_casual) %>%
        summarise(max    = max(ride_length),
                  min    = min(ride_length),
                  median = median(ride_length),
                  mean   = mean(ride_length))
member_casual      max   min   median   mean
 <chr>            <dbl>  <dbl>  <dbl>   <dbl>
1 Customer      10632022     2   1393    5373.
2 Subscriber     6096428     1    508     795.

split

Split could be used see conditionals - Looping F page for more. Here is a short example

#use split to create 54 df and a count of occurences for State in each
splitdata <- sapply(split(gooddata, gooddata$State),count)

count

Count could be used to group as well, look at this, from 3000 rows it counts how many time each State occured and provides a list of all states sorted in ascending order

#creates a list of states and counts
testcount <- gooddata |>
        count(State, sort = TRUE)

quantile

See How To: Percent & Summary for examples

quantile() produces sample quantiles corresponding to the given probabilities. The smallest corresponds to a probability of 0 and the largest to a probability of 1

Syntax:

quantile(x, probs = seq(0, 1, 1/4), na.rm = FALSE, names = TRUE, type = 7, …)
  • x numeric vector whose sample quantiles are wanted, or an object of a class for which a method has been defined (see also ‘details’). NA and NaN values are not allowed in numeric vectors unless na.rm is TRUE.
  • probs numeric vector of probabilities with values in . (Values up to 2e-14 outside that range are accepted and moved to the nearby endpoint.)
  • na.rm logical; if true, any NA and NaN’s are removed from x before the quantiles are computed.
  • names logical; if true, the result has a names attribute. Set to FALSE for speedup with many probs.
  • type an integer between 1 and 9 selecting one of the nine quantile algorithms detailed below to be used.
    • Discontinuous sample quantile types 1, 2, and 3
    • Continuous sample quantile types 4 through 9
  • … further arguments passed to or from other methods.
quantile(x1, probs = seq(0, 1, 1/4), na.rm = TRUE, names = TRUE, type = 9)

# if the data doesn't conatin NAs we can use
quantile(data, probs = seq(0, 1, 1/4))

n_distinct

n_distinct() counts the number of unique/distinct combinations in a set of one or more vectors. It’s a faster and more concise equivalent to nrow(unique(data.frame(…))). Usage: n_distinct(…, na.rm = FALSE)

 x <- c(1, 1, 2, 2, 2)
n_distinct(x)     # two distinct values in x 
[1] 2
 y <- c(3, 3, NA, 3, 3)
n_distinct(y)     # two distinct values as well, because na.rm defaults to FALSE
[1] 2
n_distinct(y, na.rm = TRUE)
[1] 1
n_distinct(x,y)              # I don't see how this is not 4!
[1] 3
n_distinct(x,y, na.rm = TRUE)         # we pair 1, 2 with 3 and we get 1,3 - 2,3
[1] 2

Let’s use it with the comparecare dataset as above with datain being the original data in the function, and we want to find out how many distinct States/column[7] are in the data

dis_list <- n_distinct(datain$State)

nrow

counts the number of rows in a group/df/list…

group_by


Let’s say we want to know the breakdown for each category of rider and for each day of the week. So we have to group the data into 2 categories based on 2 variables, then calculate what we need.

# USE SUMMARISE AND GROUP_BY WITH MULTIPLE VARIABLES
trips_19_20 %>%
        group_by(member_casual, week_day) %>%
        summarise( 
                max    = max(ride_length),
                min    = min(ride_length),
                median = median(ride_length),
                mean   = mean(ride_length))

`summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.

# A tibble: 14 × 6 
# Groups:   member_casual [2] ...
member_casual week_day      max   min median  mean 
<chr>            <dbl>    <dbl> <dbl>  <dbl> <dbl>
1 Customer             1  8636205     3  1606. 5061. 
2 Customer             2  2363255     8  1194  4752.  
3 Customer             3  4170291    11  1180  4562. 
4 Customer             4  4482201     6  1426. 4480.  
...

add other columns

# Add count column
trips_19_20 %>%
        group_by(member_casual, week_day) %>% 
        summarise(max = max(ride_length),
                  min = min(ride_length), 
                  median = median(ride_length),
                  mean = mean(ride_length),
                  number_of_rides =n()) 
arrange(member_casual, week_day)

Table


The table() function in R can be used to quickly create frequency tables. Instead of checking for unique/distinct values we can create a table for a specific column to see how may items appear and at what frequency. For example, we just used a case() function on column member_casual to replace some values, just to make sure we ended up with ONLY two desired values and we didn’t miss anything we can use the colde below and we get the two values we want.

So table(df$column) gives us the breakdown of all values in that column

table(all_trips19_20$member_casual)
OUTPUT   
Customer Subscriber
71643     720313 

useNA

ifany

If we want to still include NA values in the table we can useNA() like this

table(table_blah$zipcodes_col, useNA = "ifany")

Summary


Is a generic function used to produce result summaries of the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument. What summary does could be done in individual commands as well as I’ll show below:

#______MAX, MIN, MEAN, MEDIAN FOR ride_length individually
min_trip_hours = round(min(all_trips19_20_v3$ ride_length)/3600, 2)
max_trip_minutes = round(max(all_trips19_20_v3$ ride_length)/60,2)
max_trip_hours = round(max(all_trips19_20_v3$ ride_length)/3600, 2)
median(all_trips19_20_v3$ ride_length)
mean(all_trips19_20_v3$ ride_length)

#_______________________DO IT ALL IN ONE STATEMENT
summary(all_trips19_20_v3$ride_length )
Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
1      331      539     1189      912 10632022 

Aggregate


  • Splits the data into subsets
  • computes summary statistics for each
  • returns the result in a convenient form
  • Very similar to the combination of summarise() and groupby above but can only be applied for one calculation, while summarise above can be a combination of many (min, max, median, mean…)

The issue with aggregate is if you want to calculate, mean, median, max, min… and other calculations, you’ll need a command for each one. So look below (summarise) for the combination of group_by and summarise to solve that with just one command

In order to use the aggregate function for mean in R, you will need to specify the numerical variable on the first argument, the categorical (as a list) on the second and the function to be applied (in this case mean) on the third. An alternative is to specify a formula of the form: numerical ~ categorical

Two variables

  • aggregate() will calculate what we want for a specific group from the dataset.
  • here below we’ll calculate the mean of the ride_length for each category in member_casual
  • Notice the column names correspond to the columns we match for, not for the mean
  • So if we want to rename the column name from ride_length to mean we’d have to do that in a separate command
  • Or we can just use the summarise() & groupby combination from above
  • The variable on the right of the ~ is the grouped by variable
  • The variable on the left side of the ~ is what we are applying the function to (mean)
  • So here below: we are calculating the mean of the ride_length for every group in member_casual
#__________________AGGREGATE TWO VARIABLES
all_trips19_20_v3 %>% 
        aggregate(ride_length ~ member_casual, mean) 
   
   member_casual ride_length 
1      Customer   5372.7839 
2    Subscriber    794.9416

Three variables

  • Now let’s breakdown the same statistic: the mean of ride_length for each group type of rider
  • but now let’s go further and break it down for each day of the week so we add + day_of_week
#______AGGREGATE WITH THREE VARIABLES
all_trips19_20_v3 %>%
        aggregate(ride_length ~ member_casual + day_of_week, mean)

    member_casual day_of_week ride_length
1       Customer      Friday   6090.7373 
2     Subscriber      Friday    796.7338 
3...

Now let’s calculate the mean of ride_length for each day of the week and further for each group type in member_casual

# ______WHAT IF WE REVERSE THE LAST TWO VARIABLES 
all_trips19_20_v3 %>%
        aggregate(ride_length ~ day_of_week + member_casual, mean) 

    day_of_week member_casual ride_length
1       Friday      Customer   6090.7373
2       Monday      Customer   4752.0504
.....
8       Friday      Subscriber 3578.2457...

Arrange


sort

  • Just as it sounds, default is ascending/increasing.
  • To reverse we set decreasing=TRUE
x <- data.frame("var1"=c(2,5,3,4,1), "var2"=c(4,2,6,NA,3))
x
  var1 var2
1    2    4
2    5    2
3    3    6
4    4   NA
5    1    3
x$var1  
[1] 2 5 3 4 1
sort(x$var1) 
[1] 1 2 3 4 5
sort(x$var1, decreasing = TRUE)
[1] 5 4 3 2 1

order

Is another way to organize the objects in order. If we use multiple indexes, R will order with the first one, and if a tie exists it will order again based on the second index to break the tie

x 
  var1 var2
1    2    4
2    5    2
3    3    6
4    4   NA
5    1    3
x[order(x$var1),]
  var1 var2
5    1    3
1    2    4
3    3    6
4    4   NA
2    5    2

arrange

To arrange the column bill_length_mm in ascending order. Asc is default. arrange() orders the rows of a data frame by the values of selected columns.

  • Unlike other dplyr verbs, arrange() largely ignores grouping; you need to explicitly mention grouping variables (or use .by_group = TRUE) in order to group by them, and functions of variables are evaluated once per data frame, not once per group.
  • So first group your data, then use arrange() with .by_group=TRUE to arrange by groups.
  • arrange() does not group for you, data has to have been grouped already

arrange & group

penguins |>  arrange(bill_length_mm)  #same as
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Dream               32.1          15.5               188        3050
 2 Adelie  Dream               33.1          16.1               178        2900
 3 Adelie  Torgersen           33.5          19                 190        3600
 4 Adelie  Dream               34            17.1               185        3400
 5 Adelie  Torgersen           34.1          18.1               193        3475
 6 Adelie  Torgersen           34.4          18.4               184        3325
 7 Adelie  Biscoe              34.5          18.1               187        2900
 8 Adelie  Torgersen           34.6          21.1               198        4400
 9 Adelie  Torgersen           34.6          17.2               189        3200
10 Adelie  Biscoe              35            17.9               190        3450
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
penguins |>  arrange(bill_length_mm, .by_group = FALSE)
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Dream               32.1          15.5               188        3050
 2 Adelie  Dream               33.1          16.1               178        2900
 3 Adelie  Torgersen           33.5          19                 190        3600
 4 Adelie  Dream               34            17.1               185        3400
 5 Adelie  Torgersen           34.1          18.1               193        3475
 6 Adelie  Torgersen           34.4          18.4               184        3325
 7 Adelie  Biscoe              34.5          18.1               187        2900
 8 Adelie  Torgersen           34.6          21.1               198        4400
 9 Adelie  Torgersen           34.6          17.2               189        3200
10 Adelie  Biscoe              35            17.9               190        3450
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
penguins |> arrange (desc(bill_length_mm))    #to arrange in desc order
# A tibble: 344 × 8
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo    Biscoe           59.6          17                 230        6050
 2 Chinstrap Dream            58            17.8               181        3700
 3 Gentoo    Biscoe           55.9          17                 228        5600
 4 Chinstrap Dream            55.8          19.8               207        4000
 5 Gentoo    Biscoe           55.1          16                 230        5850
 6 Gentoo    Biscoe           54.3          15.7               231        5650
 7 Chinstrap Dream            54.2          20.8               201        4300
 8 Chinstrap Dream            53.5          19.9               205        4500
 9 Gentoo    Biscoe           53.4          15.8               219        5500
10 Chinstrap Dream            52.8          20                 205        4550
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
penguins |>  arrange(desc(bill_length_mm), .by_group = TRUE)   #to group as well
# A tibble: 344 × 8
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo    Biscoe           59.6          17                 230        6050
 2 Chinstrap Dream            58            17.8               181        3700
 3 Gentoo    Biscoe           55.9          17                 228        5600
 4 Chinstrap Dream            55.8          19.8               207        4000
 5 Gentoo    Biscoe           55.1          16                 230        5850
 6 Gentoo    Biscoe           54.3          15.7               231        5650
 7 Chinstrap Dream            54.2          20.8               201        4300
 8 Chinstrap Dream            53.5          19.9               205        4500
 9 Gentoo    Biscoe           53.4          15.8               219        5500
10 Chinstrap Dream            52.8          20                 205        4550
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Notice that when you just run `arrange()` without saving your data to a new data frame, it does not alter the existing data frame. It just provides value as you work and lose it unless you assign it to a vector/list/df/…

multiple columns

# Arrange in ascending order of rate=col[3] first and name=col[2] second to break a tie if it exists         #and assign the row number to the new rank column         
blah <- gooddata |> 
        arrange((gooddata[3]), gooddata[2]) |>
        mutate(rank = row_number())

across

dplyr package: across() makes it easy to apply the same transformation to multiple columns, allowing you to use select() semantics inside in “data-masking” functions like summarise() and mutate()

  • If we want to arrange the entire dataset
  • in descending order
  • using any column that starts with “Sepal”
  • we use
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
  • As you see below the entire dataset is arranged, sorted, ordered in descending order starting with
  • the first column that starts with “Sepal”
  • then it moves across to the second column that starts_with “Sepal” and arranges the data again
  • you can see it clearly at row 9-12
  • note what it does when the first column is 7.2
desc_across <- iris |> arrange(across(starts_with("Sepal"),desc))
head(desc_across,20)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           7.9         3.8          6.4         2.0  virginica
2           7.7         3.8          6.7         2.2  virginica
3           7.7         3.0          6.1         2.3  virginica
4           7.7         2.8          6.7         2.0  virginica
5           7.7         2.6          6.9         2.3  virginica
6           7.6         3.0          6.6         2.1  virginica
7           7.4         2.8          6.1         1.9  virginica
8           7.3         2.9          6.3         1.8  virginica
9           7.2         3.6          6.1         2.5  virginica
10          7.2         3.2          6.0         1.8  virginica
11          7.2         3.0          5.8         1.6  virginica
12          7.1         3.0          5.9         2.1  virginica
13          7.0         3.2          4.7         1.4 versicolor
14          6.9         3.2          5.7         2.3  virginica
15          6.9         3.1          4.9         1.5 versicolor
16          6.9         3.1          5.4         2.1  virginica
17          6.9         3.1          5.1         2.3  virginica
18          6.8         3.2          5.9         2.3  virginica
19          6.8         3.0          5.5         2.1  virginica
20          6.8         2.8          4.8         1.4 versicolor

Options


sum

You can use sum to count all the occurences of a variable at a condition, for example

sum((all_trips19_20$ride_length == 0))
[1] 93

mean

mean avg

Remember to always use the dataset and $ followed by column. If you had used the arrange function prior to calculating the mean/avg/min/max you’d get the same value because arrange doesn’t alter the dataset, just arranges it

mean(penguins$bill_length_mm)
[1] NA

max()  min()

To find out the maximum and minimum lead times without sorting the whole dataset using the `arrange()` function. You need to specify which dataset and which column using the $ symbol between their names.

max(penguins$bill_length_mm)
[1] NA
min(penguins$bill_length_mm)
[1] NA

Reshape


wide to long

When organizing or tidying your data using R, you might need to convert wide data to long data or long to wide. Recall that this is what data looks like in a wide format spreadsheet:

Wide data has observations across several columns. Each column contains data from a different condition of the variable. In this example the columns are different years.

Now check out the same data in a long format:

To review what you already learned about the difference, long data has all the observations in a single column, and the variable conditions are placed into separate rows.

reshape package is no longer used, instead use tidyr()

melt

melt is data.table’s wide-to-long reshaping tool. We provide an S3 method for melting data.table>s. It is written in C for speed and memory efficiency.

Syntax:

melt(data, id.vars, measure.vars, variable.name = "variable", value.name="value",
     ..., na.rm=FALSE, variable.factor=TRUE, value.factor=FALSE, 
     verbose=getOption("datable.verbose"))

So here we are going to use the mtcars data and we want to assign some variables as ID vars, and some will be Measure vars.

First of all:

  • The first column is not named, the column with all the car names so it appears that these are rownames so let’s assign the rownames to a var
  • We want to have 3 columns for carname, gear, cyl
  • We also want to assign two variables to the rows of each of the columns we chose above
  • So we melt it into id=c(column names)
  • measure.vars =c(mpg, hp)
  • So it converts the data into a long and skinny one listing the mpg values for every combination of carname, gear, and cyl
  • When it’s done with the first variable mpg it’ll create the data for the second variable= hp
head(mtcars) 
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars$carname <- rownames(mtcars) 
carMelt <- melt(mtcars,
                id=c("carname", "gear", "cyl"),
                measure.vars=c("mpg", "hp"))
head(carMelt)
            carname gear cyl variable value
1         Mazda RX4    4   6      mpg  21.0
2     Mazda RX4 Wag    4   6      mpg  21.0
3        Datsun 710    4   4      mpg  22.8
4    Hornet 4 Drive    3   6      mpg  21.4
5 Hornet Sportabout    3   8      mpg  18.7
6           Valiant    3   6      mpg  18.1
tail(carMelt)
          carname gear cyl variable value
59  Porsche 914-2    5   4       hp    91
60   Lotus Europa    5   4       hp   113
61 Ford Pantera L    5   8       hp   264
62   Ferrari Dino    5   6       hp   175
63  Maserati Bora    5   8       hp   335
64     Volvo 142E    4   4       hp   109
  • So it basically reshaped the data set so that it’s tall and skinny and so there’s one row for every mpg and one row for every hp.
  • So once we’ve melted the data set we can recast it in a bunch of different ways
  • We can basically reformat the data set into different sort of shapes

So to do that we’re gonna use the dcast function. 

dcast

The dcast function will recast the data set into a particular shape, into a particular data frame. So here, what we’re going to do is, we’re going to pass it the melted data set. 

  • And suppose, we wanted to see the cylinders, and we wanted to see that broken down by the different variables.
  • So as you see it will take the ID variable = cylinder and for each it will display the value variables: mpg & hp as specified above.
  • So we are putting the ID cyl in the rows and the value vars in the columns.
  • The results don’t really mean anything because if you read the warning given by dcast: “Aggregation function missing: defaulting to length”
  • It tells us that there are 11 rows for 4 cyl type cars that have mpg values and hp values
  • There are 7 rows for 6 cyl type cars
  • There are 14 rows for 8 cyl type cars
cylData <-dcast(carMelt, cyl ~ variable)
Aggregation function missing: defaulting to length
cylData 
  cyl mpg hp
1   4  11 11
2   6   7  7
3   8  14 14

Here is another example of using dcast, but this time

  • Let’s cast the variable: mpg & hp onto carnames
c_names <- dcast(carMelt, carname ~ variable)
c_names
               carname  mpg  hp
1          AMC Javelin 15.2 150
2   Cadillac Fleetwood 10.4 205
3           Camaro Z28 13.3 245
4    Chrysler Imperial 14.7 230
5           Datsun 710 22.8  93
6     Dodge Challenger 15.5 150
7           Duster 360 14.3 245
8         Ferrari Dino 19.7 175
9             Fiat 128 32.4  66
10           Fiat X1-9 27.3  66
11      Ford Pantera L 15.8 264
12         Honda Civic 30.4  52
13      Hornet 4 Drive 21.4 110
14   Hornet Sportabout 18.7 175
15 Lincoln Continental 10.4 215
16        Lotus Europa 30.4 113
17       Maserati Bora 15.0 335
18           Mazda RX4 21.0 110
19       Mazda RX4 Wag 21.0 110
20            Merc 230 22.8  95
21           Merc 240D 24.4  62
22            Merc 280 19.2 123
23           Merc 280C 17.8 123
24          Merc 450SE 16.4 180
25          Merc 450SL 17.3 180
26         Merc 450SLC 15.2 180
27    Pontiac Firebird 19.2 175
28       Porsche 914-2 26.0  91
29      Toyota Corolla 33.9  65
30       Toyota Corona 21.5  97
31             Valiant 18.1 105
32          Volvo 142E 21.4 109

mean & dcast

  • Since we have multiple cars in each cyl classifications
  • To calculate the mean for it we just add the function in the statement
  • dcast will give us the mean for all cars in each of the variables for each cyl class
cylData <- dcast(carMelt, cyl ~ variable, mean)
cylData
  cyl      mpg        hp
1   4 26.66364  82.63636
2   6 19.74286 122.28571
3   8 15.10000 209.21429