library(gt)
library(dplyr)
Convert Column
data
Conversion Key
- There are times when observations are obscure and make no logical sense but are easier to collect that way, so we go ahead and collect them in the most efficient way and when it’s time to process we can always convert the data to something more understandable
- Often it is more cost effective to follow this process than spend the resources designing mechanisms to alter the collection methods
- Let’s look at extracteddata
1:3] |>
extracteddata[ ,gt_preview(top_n = 5)
subject | activity | tBodyAcc-mean()-X | |
---|---|---|---|
1 | 2 | 5 | 0.2571778 |
2 | 2 | 5 | 0.2860267 |
3 | 2 | 5 | 0.2754848 |
4 | 2 | 5 | 0.2702982 |
5 | 2 | 5 | 0.2748330 |
6..10298 | |||
10299 | 30 | 2 | 0.3515035 |
#Let's look at the way the activities are displayed in the column before conversion
unique(extracteddata$activity)
[1] 5 4 6 1 3 2
column to be edited
- extracteddata is brought over from how_to_extract page
- Looking at the first 3 columns of extracteddata
- The second column named activity contains numbers from 1-6 which correspond to the keys for each activity
- It would make more sense to replace the integers in column 2 with their respective description “walking, standing….
conversion keys
gt_preview(labelfile)
V1 | V2 | |
---|---|---|
1 | 1 | WALKING |
2 | 2 | WALKING_UPSTAIRS |
3 | 3 | WALKING_DOWNSTAIRS |
4 | 4 | SITTING |
5 | 5 | STANDING |
6 | 6 | LAYING |
- so far we’ve identified activity as the column to edit
- we’ll use the keys found in labelfile to swap out the values in activity
- note that the key values are all capitalized, so we’ll convert to lower case as well
case_match
- case_match is identical to CASE WHEN in SQL and goes back decades as older languages had similar functions
- I’ll break this down into smaller steps:
- we know extracteddata$activity is a number 1:6
- we know labelfile(1,2) = WALKING
- we know labelfile(2,2) = WALKING_UPSTAIRS….
- extracteddata$activity contains values 1:6 so we can match the value in extracteddata$activity (which is 1:6) with the row number of labelfile and use the value column 2 of labelfile to assign to the old value of extracteddata$acitivity
- before we assign the value we convert it to lower case
- here is the code broken down by steps
- we have already seen labelfile and extracteddata$activity above, let’s look at what it would look like if we use extracteddata$activity as the row number of labelfile
- as you see now that new_labelfile has the same number of rows as extracteddata, and the new_labelfile$activity column is not longer numbers from 1:6 but the assigned values from the key df labelfile
<- labelfile[extracteddata$activity, 2]
new_labelfile gt_preview(new_labelfile, top_n = 5)
data | |
---|---|
1 | STANDING |
2 | STANDING |
3 | STANDING |
4 | STANDING |
5 | STANDING |
6..10298 | |
10299 | WALKING_UPSTAIRS |
- all we do now is use case_match to assign the values in new_labelfile column 2 to extracteddata$activity
- after we convert to lower case
- the code below shows the process
$activity <- case_match(
extracteddata$activity,
extracteddata$activity ~ tolower(new_labelfile)) extracteddata
one line of code
#---------- Convert labelfile to lower case and assign to their numeric values above
$activity <- case_match(
extracteddata$activity,
extracteddata$activity ~ tolower(labelfile[extracteddata$activity,2])) extracteddata
confirm results
- Let’s look at the activity labels after the case_match
- And the first 3 columns of extracteddata
unique(extracteddata$activity)
[1] "standing" "sitting" "laying"
[4] "walking" "walking_downstairs" "walking_upstairs"
1:3] |>
extracteddata[ ,gt_preview(top_n = 5)
subject | activity | tBodyAcc-mean()-X | |
---|---|---|---|
1 | 2 | standing | 0.2571778 |
2 | 2 | standing | 0.2860267 |
3 | 2 | standing | 0.2754848 |
4 | 2 | standing | 0.2702982 |
5 | 2 | standing | 0.2748330 |
6..10298 | |||
10299 | 30 | walking_upstairs | 0.3515035 |