Group & Calculate

Many times we’ll want to group the data by category so we can

analyze it
or so we can save it into smaller tidy data for later analysis or to share

Both melt() and dcast() are covered in Clean & Manipulate section of Summarize - Arrange

Data: Samsung Fitbit

library(gt)
gt(extracteddata[1:5,1:5]) |> opt_table_outline()

subject	activity	tBodyAcc-mean()-X	tBodyAcc-mean()-Y	tBodyAcc-mean()-Z
2	standing	0.2571778	-0.02328523	-0.01465376
2	standing	0.2860267	-0.01316336	-0.11908252
2	standing	0.2754848	-0.02605042	-0.11815167
2	standing	0.2702982	-0.03261387	-0.11752018
2	standing	0.2748330	-0.02784779	-0.12952716

same data used in Samsung Fitbit - Tidying case study document
extracteddata is a data.frame: 10299 x 81
as you see in the table, the first column is subject, second is activity, then 81 additional columns for the remaining variables

Melt to Group

Case Study

We want to GROUP the data by 2 ids = columns( subject & activity )
The remaining columns are to be the variables

Note: Melt is data.table’s wide-to-long reshaping tool.

melt converts the data into a long and skinny one listing the 3:81 values for every combination of Ids specified: activity and subject
first, we specify id=c(column names) = c(“activity”,“subject”)
secondly, we set measure.vars = varnames = all the other columns excluding the first two which are “activity” & “subject” = 3:81

melt

create a vector of the 81 columns for the measure.vars, and set the first two columns to the Ids

varnames <- c(names(extracteddata[3:81]))
meltdata  <- melt(extracteddata, id=c("activity", "subject"), measure.vars = varnames )

dim(meltdata)

[1] 813621      4

gt(meltdata[1:5,]) |> opt_table_outline()

activity	subject	variable	value
standing	2	tBodyAcc-mean()-X	0.2571778
standing	2	tBodyAcc-mean()-X	0.2860267
standing	2	tBodyAcc-mean()-X	0.2754848
standing	2	tBodyAcc-mean()-X	0.2702982
standing	2	tBodyAcc-mean()-X	0.2748330

So now we have a table of 813621 rows
4 columns with the 2 IDs: activity & subject
2 columns for the variable name & variable value
the table will go through each variable for each activity, for each subject and give us the value

Dcast to Calculate Mean

The dcast function will recast the data set into a particular shape/data frame.

NOTE: using dcast will allow us to make calculations vertically, in other words it will subset the data vertically into groups and process the function (in this case mean)

Case Study

measure the mean/average value of each variable for each subject while performing each activity
remember there are 30 subjects and 6 activities
so we should have 180 rows and 83 columns of mean values
as you’ll see below dcast will scan through all 30 subjects starting with the first activity “laying” and calculate the function (mean) for each column

dcast & melt

The mean of each measurement/variable for each subject while performing each activity. Will use the melted data

persubfile <- reshape2::dcast(meltdata, activity + subject ~ variable, mean)
gt(persubfile[1:7,1:5]) |>opt_table_outline()

activity	subject	tBodyAcc-mean()-X	tBodyAcc-mean()-Y	tBodyAcc-mean()-Z
laying	1	0.2215982	-0.04051395	-0.1132036
laying	2	0.2813734	-0.01815874	-0.1072456
laying	3	0.2755169	-0.01895568	-0.1013005
laying	4	0.2635592	-0.01500318	-0.1106882
laying	5	0.2783343	-0.01830421	-0.1079376
laying	6	0.2486565	-0.01025292	-0.1331196
laying	7	0.2501767	-0.02044115	-0.1013610

The mean for each measurement of each activity grouped by each subject

 peractivityfile <- reshape2::dcast(meltdata,  subject + activity ~ variable, mean)
 gt(peractivityfile[1:7, 1:5]) |> opt_table_outline()

subject	activity	tBodyAcc-mean()-X	tBodyAcc-mean()-Y	tBodyAcc-mean()-Z
1	laying	0.2215982	-0.040513953	-0.1132036
1	sitting	0.2612376	-0.001308288	-0.1045442
1	standing	0.2789176	-0.016137590	-0.1106018
1	walking	0.2773308	-0.017383819	-0.1111481
1	walking_downstairs	0.2891883	-0.009918505	-0.1075662
1	walking_upstairs	0.2554617	-0.023953149	-0.0973020
2	laying	0.2813734	-0.018158740	-0.1072456

The mean for each measurement grouped by activity for all subjects

allsub_peractivity <- reshape2::dcast(meltdata,  activity ~ variable, mean)
gt(allsub_peractivity[1:6,1:5]) |> opt_table_outline()

activity	tBodyAcc-mean()-X	tBodyAcc-mean()-Y	tBodyAcc-mean()-Z	tBodyAcc-std()-X
laying	0.2686486	-0.01831773	-0.1074356	-0.9609324
sitting	0.2730596	-0.01268957	-0.1055170	-0.9834462
standing	0.2791535	-0.01615189	-0.1065869	-0.9844347
walking	0.2763369	-0.01790683	-0.1088817	-0.3146445
walking_downstairs	0.2881372	-0.01631193	-0.1057616	0.1007663
walking_upstairs	0.2622946	-0.02592329	-0.1205379	-0.2379897

Mean of Groups

In the section above we used dcast & melt to calculate the mean for each variable vertically per group and we had 81 variable. But if we want to calculate the mean of all the variables for each group then we need to calculate it on the melted data prior to using dcast (because dcast recasts the data from narrow and long to wide)

Remember what meltdata looked like (narrow and long)

gt(meltdata[1:5,]) |> opt_table_outline()

activity	subject	variable	value
standing	2	tBodyAcc-mean()-X	0.2571778
standing	2	tBodyAcc-mean()-X	0.2860267
standing	2	tBodyAcc-mean()-X	0.2754848
standing	2	tBodyAcc-mean()-X	0.2702982
standing	2	tBodyAcc-mean()-X	0.2748330

Case Study

using meltdata which is 813621 X 4 calculate the mean of all variables for each activity

aggregate & melt 1 group

The mean for all variables together for each activity using melt

meltdata |> aggregate(value ~ activity, mean) |>
        gt() |> opt_table_outline()

activity	value
laying	-0.6010010
sitting	-0.6028592
standing	-0.6139060
walking	-0.2344983
walking_downstairs	-0.1476207
walking_upstairs	-0.2904126

aggregate w/o melt 1 group

across

rowMeans

Can we do it without melting the data?

If you remember meltdata ( 8136621 X 4 ) is the byproduct of using melt() on extracteddata (10299 X 81)
Can we calculate the mean of each activity for all variables using aggregate without melting first?
First: we calculate the mean of each row across all the columns except for the first two
Create a new column called meanofrow to hold the mean of each row which is calculated by rowMeans()
Aggregate that column of meanofrow for each activity using the mean() function
We get the same result as above

The mean for all variables together for each activity without using melt

library(dplyr)
blah <- extracteddata |>
        mutate(meanofrow=rowMeans(across(-c(subject,activity))))

blah |> aggregate(meanofrow ~ activity, mean)|>
        gt() |> opt_table_outline()

activity	meanofrow
laying	-0.6010010
sitting	-0.6028592
standing	-0.6139060
walking	-0.2344983
walking_downstairs	-0.1476207
walking_upstairs	-0.2904126

Case Study

aggregate & melt 2 groups

Calculate the mean of all variables per activity for each subject

meltdata |> aggregate(value ~ subject + activity, mean) |>  
        gt_preview(top_n = 6) |> opt_table_outline()

	subject	activity	value
1	1	laying	-0.5629116
2	2	laying	-0.5945549
3	3	laying	-0.5984263
4	4	laying	-0.6002759
5	5	laying	-0.5960274
6	6	laying	-0.5824931
7..179
180	30	walking_upstairs	-0.3514358

Case Study

What if we just want the mean of all variables/measurements for each activity for only one user/subject #6?
We already calculated the mean of all variables for each activity above in aggregate & melt 1 group
Let’s look at it again

Mean of all vars/activity
activity	value
laying	-0.6010010
sitting	-0.6028592
standing	-0.6139060
walking	-0.2344983
walking_downstairs	-0.1476207
walking_upstairs	-0.2904126

aggregate & filter

Mean of all vars/activity for Subject #6

subset(meltdata, subject == "6") |> aggregate(value ~ activity, mean) |>
        gt() |> 
        opt_table_outline() |>
        tab_header( title = "Mean of all vars/activity",
                    subtitle = "for Subject #6")

Mean of all vars/activity
for Subject #6
activity	value
laying	-0.582493068
sitting	-0.591247924
standing	-0.608446225
walking	-0.189527910
walking_downstairs	-0.001814892
walking_upstairs	-0.186050139

Split

Case Study

using meltdata which is 813621 X 4 calculate the mean of all variables for each activity
Don’t use melt, aggregate

split

lapply

colMeans

Split data by activity
calculate the mean of each of the 81 columns using lapply & colMeans
now you’ll have a 6X81 object with all the means for each variable per activity
use lapply & mean to calculate the mean of each row
now, you have a list of 6 means for each of the 6 acitivities
convert the list to df so we can plot it in gt

s <- split(extracteddata,extracteddata$activity)
# s is a list of 6 dfs one for each activity 
result <- lapply(s, function(extracteddata) colMeans(extracteddata[ ,3:81]))
# result is a list of 6 dfs each one has 81 columns one for each mean of that variable 
#(a list of means)
act_mean <- lapply(result, mean)
# act_mean is a list of means for each activity - convert to df and create table
data.frame(act_mean)|>  gt() |> opt_table_outline()

laying	sitting	standing	walking	walking_downstairs	walking_upstairs
-0.601001	-0.6028592	-0.613906	-0.2344983	-0.1476207	-0.2904126

Factor

act_lvls <- c("walking","laying","sitting","standing","walking_downstairs",
              "walking_upstairs")
ac <- factor(extracteddata, levels = act_lvls)
sort(ac)

factor()
6 Levels: walking laying sitting standing ... walking_upstairs