Extract Column


The code for bringing the data over is the same used in the other documents in this section, there is no need to clutter the page.

Search for Conditional Columns


Case Study

  • Very often we’ll deal with large datasets where you need to search through variables/columns and extract them so you can run your analysis on a smaller/more relevant subset of the data
  • mergedfile has 563 columns and we don’t know how many of those columns actually reference the desired words we are searching for
  • so we use grep or grepl to isolate those column names that meat the criteria of measuring the mean or std

grep

Loop through the entire list of column names to IDENTIFY the columns that involve the mean() or the std()

grepped <- grep("subject|activity|-[mM]ean()|-[sS]td()",names(mergedfile))
grepped
 [1]   1   2   3   4   5   6   7   8  43  44  45  46  47  48  83  84  85  86  87
[20]  88 123 124 125 126 127 128 163 164 165 166 167 168 203 204 216 217 229 230
[39] 242 243 255 256 268 269 270 271 272 273 296 297 298 347 348 349 350 351 352
[58] 375 376 377 426 427 428 429 430 431 454 455 456 505 506 515 518 519 528 531
[77] 532 541 544 545 554
  • reminder: grep will take as input a string to search for

    "subject|activity|-[mM]ean()|-[sS]td()"
  • it will look through the specified variable

    names(mergedfile)
  • will return a vector of all the instances (columns) where any part (in this case, since it’s a string with | conditions) of the string is present 81 columns

length(grepped)
[1] 81

grepl

  • similar to grep but it’ll return a list of TRUE | FALSE for each row in the dataset
  • if you look closely at the output TRUE rows match with those of grep (as they should)
  • as opposed to grep which returns 81 columns, grepled is that same length as the original dataset 563 with 482 FALSE columns and 81 TRUE columns

Loop through the entire list of column names to see which columns meet the desired crieteria and return a list of T/F for each of the 563 columns

grepled <- grepl("subject|activity|-[mM]ean()|-[sS]td()",names(mergedfile))
dim(grepled)
NULL
table(grepled)
grepled
FALSE  TRUE 
  482    81 

Extract Matched Columns


Case Study

  • In the section above we identified the columns that measure the mean | std and meet our criteria
  • We’ll extract/subset those columns out to create the desired dataset
  • Save the subset in extracteddata

filter & grepl

  • as you see above grepl gives us a list of T or F for all the rows in the dataset
  • what if we filter the dataset to only the TRUE columns?
  • would that be in effect the same as using grep in the previous paragraph?
  • Let’s see
  • What do you know, we get grepl_subset all 10299 X 81 to match the number of columns from grep alone, and yet we subset it directly using grepl

Filter the dataset using df[x,y] and the grepl list to show the matched columns only

grepl_subset <- 
        mergedfile[,grepl("subject|activity|-[mM]ean()|-[sS]td()",names(mergedfile))]
head(grepl_subset[, c(1:5)],5)   # for simplicity can use [ , 1:5] 
  subject activity tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1       2        5         0.2571778       -0.02328523       -0.01465376
2       2        5         0.2860267       -0.01316336       -0.11908252
3       2        5         0.2754848       -0.02605042       -0.11815167
4       2        5         0.2702982       -0.03261387       -0.11752018
5       2        5         0.2748330       -0.02784779       -0.12952716

select & grep

  • We’ll start by looking at specific columns that mention a certain variable in their name(s)
  • At this time we are only interested in certain columns that involve the mean() or std() and have either in their colnames
  • I’ll use grep() to identify the columns that meet the criteria of either subject OR activity OR -Mean() OR -std()
  • | is the symbol for OR
  • remember [sS] means either s or capped S
  • we’ll let grep search through the names() of the data: mergedfile and return a list of all rows where the string is found as you see aboveCol
  • grep information can be found at Clean - Edit page
  • use select to extract the desired columns and assign it to extracteddata

Select the columns that meet the criteria using select & grep and save in extracteddata

library(dplyr)

extracteddata <- mergedfile |> 
                select(grep("subject|activity|-[mM]ean()|-[sS]td()",names(mergedfile)))
dim(extracteddata)
[1] 10299    81