Group & Calculate


Many times we’ll want to group the data by category so we can

  • analyze it
  • or so we can save it into smaller tidy data for later analysis or to share

Both melt() and dcast() are covered in Clean & Manipulate section of Summarize - Arrange

Data: Samsung Fitbit


library(gt)
gt(extracteddata[1:5,1:5]) |> opt_table_outline()
subject activity tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
2 standing 0.2571778 -0.02328523 -0.01465376
2 standing 0.2860267 -0.01316336 -0.11908252
2 standing 0.2754848 -0.02605042 -0.11815167
2 standing 0.2702982 -0.03261387 -0.11752018
2 standing 0.2748330 -0.02784779 -0.12952716
  • same data used in Samsung Fitbit - Tidying case study document
  • extracteddata is a data.frame: 10299 x 81
  • as you see in the table, the first column is subject, second is activity, then 81 additional columns for the remaining variables

Melt to Group


Case Study

  • We want to GROUP the data by 2 ids = columns( subject & activity )
  • The remaining columns are to be the variables

Note: Melt is data.table’s wide-to-long reshaping tool.

  • melt converts the data into a long and skinny one listing the 3:81 values for every combination of Ids specified: activity and subject
  • first, we specify id=c(column names) = c(“activity”,“subject”)
  • secondly, we set measure.vars = varnames = all the other columns excluding the first two which are “activity” & “subject” = 3:81

melt

create a vector of the 81 columns for the measure.vars, and set the first two columns to the Ids

varnames <- c(names(extracteddata[3:81]))
meltdata  <- melt(extracteddata, id=c("activity", "subject"), measure.vars = varnames )
dim(meltdata)
[1] 813621      4
gt(meltdata[1:5,]) |> opt_table_outline() 
activity subject variable value
standing 2 tBodyAcc-mean()-X 0.2571778
standing 2 tBodyAcc-mean()-X 0.2860267
standing 2 tBodyAcc-mean()-X 0.2754848
standing 2 tBodyAcc-mean()-X 0.2702982
standing 2 tBodyAcc-mean()-X 0.2748330
  • So now we have a table of 813621 rows
  • 4 columns with the 2 IDs: activity & subject
  • 2 columns for the variable name & variable value
  • the table will go through each variable for each activity, for each subject and give us the value

Dcast to Calculate Mean


The dcast function will recast the data set into a particular shape/data frame.

NOTE: using dcast will allow us to make calculations vertically, in other words it will subset the data vertically into groups and process the function (in this case mean)

Case Study

  • measure the mean/average value of each variable for each subject while performing each activity
  • remember there are 30 subjects and 6 activities
  • so we should have 180 rows and 83 columns of mean values
  • as you’ll see below dcast will scan through all 30 subjects starting with the first activity “laying” and calculate the function (mean) for each column

dcast & melt

The mean of each measurement/variable for each subject while performing each activity. Will use the melted data

persubfile <- reshape2::dcast(meltdata, activity + subject ~ variable, mean)
gt(persubfile[1:7,1:5]) |>opt_table_outline()
activity subject tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
laying 1 0.2215982 -0.04051395 -0.1132036
laying 2 0.2813734 -0.01815874 -0.1072456
laying 3 0.2755169 -0.01895568 -0.1013005
laying 4 0.2635592 -0.01500318 -0.1106882
laying 5 0.2783343 -0.01830421 -0.1079376
laying 6 0.2486565 -0.01025292 -0.1331196
laying 7 0.2501767 -0.02044115 -0.1013610

The mean for each measurement of each activity grouped by each subject

 peractivityfile <- reshape2::dcast(meltdata,  subject + activity ~ variable, mean)
 gt(peractivityfile[1:7, 1:5]) |> opt_table_outline()
subject activity tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1 laying 0.2215982 -0.040513953 -0.1132036
1 sitting 0.2612376 -0.001308288 -0.1045442
1 standing 0.2789176 -0.016137590 -0.1106018
1 walking 0.2773308 -0.017383819 -0.1111481
1 walking_downstairs 0.2891883 -0.009918505 -0.1075662
1 walking_upstairs 0.2554617 -0.023953149 -0.0973020
2 laying 0.2813734 -0.018158740 -0.1072456

The mean for each measurement grouped by activity for all subjects

allsub_peractivity <- reshape2::dcast(meltdata,  activity ~ variable, mean)
gt(allsub_peractivity[1:6,1:5]) |> opt_table_outline()
activity tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z tBodyAcc-std()-X
laying 0.2686486 -0.01831773 -0.1074356 -0.9609324
sitting 0.2730596 -0.01268957 -0.1055170 -0.9834462
standing 0.2791535 -0.01615189 -0.1065869 -0.9844347
walking 0.2763369 -0.01790683 -0.1088817 -0.3146445
walking_downstairs 0.2881372 -0.01631193 -0.1057616 0.1007663
walking_upstairs 0.2622946 -0.02592329 -0.1205379 -0.2379897

Mean of Groups


In the section above we used dcast & melt to calculate the mean for each variable vertically per group and we had 81 variable. But if we want to calculate the mean of all the variables for each group then we need to calculate it on the melted data prior to using dcast (because dcast recasts the data from narrow and long to wide)

Remember what meltdata looked like (narrow and long)

gt(meltdata[1:5,]) |> opt_table_outline()
activity subject variable value
standing 2 tBodyAcc-mean()-X 0.2571778
standing 2 tBodyAcc-mean()-X 0.2860267
standing 2 tBodyAcc-mean()-X 0.2754848
standing 2 tBodyAcc-mean()-X 0.2702982
standing 2 tBodyAcc-mean()-X 0.2748330

Case Study

  • using meltdata which is 813621 X 4 calculate the mean of all variables for each activity

aggregate & melt 1 group

The mean for all variables together for each activity using melt

meltdata |> aggregate(value ~ activity, mean) |>
        gt() |> opt_table_outline()
activity value
laying -0.6010010
sitting -0.6028592
standing -0.6139060
walking -0.2344983
walking_downstairs -0.1476207
walking_upstairs -0.2904126

aggregate w/o melt 1 group

across

rowMeans

Can we do it without melting the data?

  • If you remember meltdata ( 8136621 X 4 ) is the byproduct of using melt() on extracteddata (10299 X 81)
  • Can we calculate the mean of each activity for all variables using aggregate without melting first?
  • First: we calculate the mean of each row across all the columns except for the first two
  • Create a new column called meanofrow to hold the mean of each row which is calculated by rowMeans()
  • Aggregate that column of meanofrow for each activity using the mean() function
  • We get the same result as above

The mean for all variables together for each activity without using melt

library(dplyr)
blah <- extracteddata |>
        mutate(meanofrow=rowMeans(across(-c(subject,activity))))

blah |> aggregate(meanofrow ~ activity, mean)|>
        gt() |> opt_table_outline()
activity meanofrow
laying -0.6010010
sitting -0.6028592
standing -0.6139060
walking -0.2344983
walking_downstairs -0.1476207
walking_upstairs -0.2904126

Case Study

aggregate & melt 2 groups

Calculate the mean of all variables per activity for each subject

meltdata |> aggregate(value ~ subject + activity, mean) |>  
        gt_preview(top_n = 6) |> opt_table_outline() 
subject activity value
1 1 laying -0.5629116
2 2 laying -0.5945549
3 3 laying -0.5984263
4 4 laying -0.6002759
5 5 laying -0.5960274
6 6 laying -0.5824931
7..179
180 30 walking_upstairs -0.3514358

Case Study

  • What if we just want the mean of all variables/measurements for each activity for only one user/subject #6?
  • We already calculated the mean of all variables for each activity above in aggregate & melt 1 group
  • Let’s look at it again
Mean of all vars/activity
activity value
laying -0.6010010
sitting -0.6028592
standing -0.6139060
walking -0.2344983
walking_downstairs -0.1476207
walking_upstairs -0.2904126

aggregate & filter

Mean of all vars/activity for Subject #6

subset(meltdata, subject == "6") |> aggregate(value ~ activity, mean) |>
        gt() |> 
        opt_table_outline() |>
        tab_header( title = "Mean of all vars/activity",
                    subtitle = "for Subject #6")
Mean of all vars/activity
for Subject #6
activity value
laying -0.582493068
sitting -0.591247924
standing -0.608446225
walking -0.189527910
walking_downstairs -0.001814892
walking_upstairs -0.186050139

Split


Case Study

  • using meltdata which is 813621 X 4 calculate the mean of all variables for each activity
  • Don’t use melt, aggregate

split

lapply

colMeans

  • Split data by activity
  • calculate the mean of each of the 81 columns using lapply & colMeans
  • now you’ll have a 6X81 object with all the means for each variable per activity
  • use lapply & mean to calculate the mean of each row
  • now, you have a list of 6 means for each of the 6 acitivities
  • convert the list to df so we can plot it in gt
s <- split(extracteddata,extracteddata$activity)
# s is a list of 6 dfs one for each activity 
result <- lapply(s, function(extracteddata) colMeans(extracteddata[ ,3:81]))
# result is a list of 6 dfs each one has 81 columns one for each mean of that variable 
#(a list of means)
act_mean <- lapply(result, mean)
# act_mean is a list of means for each activity - convert to df and create table
data.frame(act_mean)|>  gt() |> opt_table_outline()
laying sitting standing walking walking_downstairs walking_upstairs
-0.601001 -0.6028592 -0.613906 -0.2344983 -0.1476207 -0.2904126

Factor


act_lvls <- c("walking","laying","sitting","standing","walking_downstairs",
              "walking_upstairs")
ac <- factor(extracteddata, levels = act_lvls)
sort(ac)
factor()
6 Levels: walking laying sitting standing ... walking_upstairs