Fitbit - Reproducible

Data

Packages

library(tidyverse)
library(dplyr)
library(lubridate)

Unzip folder

unzip("D:/Education/R/Data/JH_C5_week2/repdata_data_activity.zip",
      exdir= "D:/Education/R/Data/JH_C5_week2/reproducable_walking_data")

Read file

activity <- read.csv("D:/Education/R/Data/JH_C5_week2/reproducable_walking_data/activity.csv")

Convert date

First we’ll look at the type of data in each column: it appears date is a “char” so we need to convert
Covert date from “char” to date type

library(lubridate)
str(activity)

'data.frame':   17568 obs. of  3 variables:
 $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
 $ date    : chr  "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
 $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

activity$date <- ymd(activity$date)

Create new columns

wday

Creating a new column showing the day of the week would help us in the next step

weekend | weekday

Create a new column showing weekdays and weekend days
These will be helpful if we ever need to compare activity levels on weekdays vs weekends or even if we wish to explore a relationship between day of week and activity level

library(dplyr)
activity <- activity |> 
        mutate(day_of_week = wday(date, week_start = 1))
# create a weekend variable named weekend
activity <- activity |> 
        mutate(weekend = case_when(day_of_week >5 ~ "weekend",
                                        TRUE ~ "weekday"))

CS 1 - Steps per day

The question is a bit vague, do we mean the total number of steps taken per day for each day, or the total number of steps taken per day for the entire dataset?

Well I’ll do both, because I’m curious myself regardless of the intent of the question.

Let’s group the data by date, calculate the total number of steps per day and assign the values to mean_perday
mean_perday now has two columns: date & sum where sum is the total steps taken per day

mean_perday <- activity |> 
        group_by(date) |>
        summarise(sum = sum(steps))

Total steps per day

Now that we have a column of daily total number of steps taken we can simply sum them all together to get the total number of steps taken per day for the entire dataset and assign it to total_steps =570608

total_steps <- sum(mean_perday$sum, na.rm = TRUE)
total_steps

[1] 570608

Mean & median

To calculate the mean and median of the total number of steps taken per day we make use of summary() which will give us the needed answers
We get a mean=10766 and a median=10765

summary(mean_perday$sum)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
     41    8841   10765   10766   13294   21194       8

Histogram

Plot a histogram for the distribution of total number of steps per day
The magenta line is the mean of the distribution

with(mean_perday, hist(sum, col="green", breaks = 15,
                       main="Total Number of Steps per Day"))
abline(v= mean(mean_perday$sum, na.rm=TRUE), col="magenta",lwd=4)

tapply - method 2

#_____________________USING TAPPLY
total.steps <- tapply(activity$steps, activity$date, FUN=sum, na.rm=TRUE)
mean(total.steps,na.rm=TRUE)   #9354.23

[1] 9354.23

median(total.steps,na.rm = TRUE) #10395

[1] 10395

hist(total.steps, col="green", breaks = 15, main="Total Number of Steps per Day")
abline(v= mean(total.steps,na.rm=TRUE), col="magenta",lwd=4)

CS 2 - Daily activity

The activity dataset from above shows that each day is divided into 288, 5 minute intervals, in order to look for patterns in daily use it might be wise to compare all the intervals to one another.
In order to do that we’ll calculate the average steps taken for each of the 288 intervals for the entire dataset, and we can do that with the aggregate()
We’ll save the results in activity_perinterval

activity_perinterval <- activity |> 
        aggregate(steps~interval, mean)

Time series plot

Here we’ll plot a time series line graph of the Average Daily Steps per interval to see if any pattern is evident

library(ggplot2)
ggplot(activity_perinterval, aes(interval,steps))+
        geom_line(color="blue") +
        labs(title = "Average Daily Steps per Interval",
             x="Interval", y="Avg Daily Steps")+
        scale_x_continuous(breaks = seq(0,2400, by=240))+
        theme_classic()

Most active interval

From the plot it is evident that a range of intervals and in particular one interval is by far more popular by the users. To find the exact interval

We can use one line of code as shown below, but let me explain it next:
Let’s start from inside the []: which.max() scans through the steps column and returns the row where the maximum average steps resides
now that we have the row from above, when it’s placed inside the [ , ] it targets that specific row of activity_perinterval
The result shows that row 104 is for interval 835 and has the maximum average steps of 206 per 5 minutes interval

The intervals are setup this way:

First interval is at 0 which is the start of the hour, and increases by 5 every 5 minutes

Once we get to 55 the next increase puts us at the start of the next hour and therefore the interval counter become 100 NOT 60.

So every hours from there the interval counter increases from 100, 200, 300….

So interval 835 would be at 8:35 AM

activity_perinterval[which.max(activity_perinterval$steps),]

    interval    steps
104      835 206.1698

CS 3 - Coalesce NAs

Count of NAs

First let’s count how many NA are in the dataset? Total NAs = 2304
Then we find out what percentage of the total rows are NAs (which turns out to be 2.6%)

sum(is.na(activity))

[1] 2304

mean(is.na(activity))

[1] 0.02622951

Table option

nas <- is.na(activity)
table(nas)

nas
FALSE  TRUE 
85536  2304

Replace NAs

Mean of intervals

The plan is to replace each missing NA with the average of that specific interval it is found in.
In order to accomplish that, we need to calculate the mean of every interval for all the days in the dataset
We can accomplish that by using the group_by() and mutate() functions together to create a new column of all the interval means
Save the means in interval_mean

activity_mean <- activity |> 
        group_by(interval) |> 
        mutate(interval_mean =  mean(steps, na.rm = TRUE))

Coalesce

We go through every row in the df and replace every step which is NA with the mean value for that interval that can be found in the new column interval_mean
We save the new step values in nona_steps

activity_mean$nona_steps <- coalesce(activity_mean$steps, activity_mean$interval_mean)

Total steps per day

Just as we did earlier in step 1, we need to calculate the average steps per day now that all the NAs have been replaced with the average for their respective interval.
We can do that just as we did earlier and we just replace with the new df nona_mean_perday
Total steps for all days are: 656737 compared to 570608 before replacing the NAs

nona_mean_perday <- activity_mean |> 
        group_by(date) |>
        summarise(sum = sum(nona_steps))
nona_total_steps <- sum(nona_mean_perday$sum)
nona_total_steps

[1] 656737.5

Mean & median

To calculate the mean and median of the total number of steps taken per day we make use of summary() which will give us the needed answers
We get a mean= 10766 and median=10766 compared to mean=10766 and median=10765
We shouldn’t have anticipated any major changes since we replace the NAs with the mean of each interval

summary(nona_mean_perday$sum)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     41    9819   10766   10766   12811   21194

Histogram

Plot a histogram for the distribution of total number of steps per day
The magenta line is the mean of the distribution

with(nona_mean_perday, hist(sum, col="green", 
                        breaks = 15, main="Total Number of Steps per Day"))
abline(v= mean(nona_mean_perday$sum, na.rm=TRUE), col="magenta",lwd=4)

Observation

It might not be glaring but the distribution now is more concentrated around the mean. Look at the frequency count of the tallest bar it is considerably higher than before, and that again is due to replacing all the NAs with mean values.

CS 4 - Weekdays vs Weekends

If you remember when we got started I had created a column named weekend which separates the data by weekday vs weekend days
This comparison becomes fairly simple, all we have to do is aggregate the steps across 2 groups: interval and weekend
Just make sure we use the dataset that included all the repalced NAs activity_mean
Also ensure we use the accurate column that reflects the replaced NAs which would be nona_steps
Load the lattice library since I’ll use lattice for this last plot

library(lattice)
interval_weekday <- activity_mean |> 
        aggregate(nona_steps~interval + weekend, mean)

Lattice

xyplot(nona_steps~interval | weekend, interval_weekday,type="l",
       color="blue", xlab = "Interval", ylab = "Average Steps per Interval",
       pch=10, main="Weekend to Weekday Comparison",layout=c(1,2))

facet_grid

Another option is to use ggplot:

(weekend~.) makes a c(2,1) plot, one over the other
(~weekend) next to each other

ggplot(interval_weekday, aes(interval,nona_steps))+
        geom_line(color="blue") +
        labs(title = "Average Daily Steps per Interval", 
             x="Interval", y="Avg Daily Steps")+
        scale_x_continuous(breaks = seq(0,2400, by=240))+
        facet_grid(weekend~.)+  
        theme_classic()

Observation

The visual shows a clear distinction between weekdays activity and weekend days:

On the weekends the activity level is more spread out throughout the day with lower max value.
On the weekdays the activity level is more focused at a certain time of day with a significant spike.
It appears that the users might be professional users that are more active around 8:38 am but then again we have no idea what time zone this data reflects?
Depending on what the business case is there are several possibilities to pursue from this limited data.