The code for bringing the data over is the same used in the other documents in this section, there is no need to clutter the page.
Search for Conditional Columns
Case Study
Very often we’ll deal with large datasets where you need to search through variables/columns and extract them so you can run your analysis on a smaller/more relevant subset of the data
mergedfile has 563 columns and we don’t know how many of those columns actually reference the desired words we are searching for
so we use grep or grepl to isolate those column names that meat the criteria of measuring the mean or std
grep
Loop through the entire list of column names to IDENTIFY the columns that involve the mean() or the std()
reminder: grep will take as input a string to search for
"subject|activity|-[mM]ean()|-[sS]td()"
it will look through the specified variable
names(mergedfile)
will return a vector of all the instances (columns) where any part (in this case, since it’s a string with | conditions) of the string is present 81 columns
length(grepped)
[1] 81
grepl
similar to grep but it’ll return a list of TRUE | FALSE for each row in the dataset
if you look closely at the output TRUE rows match with those of grep (as they should)
as opposed to grep which returns 81 columns, grepled is that same length as the original dataset 563 with 482 FALSE columns and 81 TRUE columns
Loop through the entire list of column names to see which columns meet the desired crieteria and return a list of T/F for each of the 563 columns
In the section above we identified the columns that measure the mean | std and meet our criteria
We’ll extract/subset those columns out to create the desired dataset
Save the subset in extracteddata
filter & grepl
as you see above grepl gives us a list of T or F for all the rows in the dataset
what if we filter the dataset to only the TRUE columns?
would that be in effect the same as using grep in the previous paragraph?
Let’s see
What do you know, we get grepl_subset all 10299 X 81 to match the number of columns from grep alone, and yet we subset it directly using grepl
Filter the dataset using df[x,y] and the grepl list to show the matched columns only
grepl_subset <- mergedfile[,grepl("subject|activity|-[mM]ean()|-[sS]td()",names(mergedfile))]head(grepl_subset[, c(1:5)],5) # for simplicity can use [ , 1:5]