Assign Colnames


At times the colnames are either

  • not included in the dataset but were in another file/table: features.txt and are therefore imported over
  • made up and need to be inserted into a df locally
  • are one of the rows in the dataset

External - Fitbit


Case Study

  • mergedfile is a df of test and train data merged together from two separate files
  • both files prior to being merged had headers were not descriptive of the variables
  • mergedfile does therefore have original colnames that need to be reassigned new ones provided in
  • features.txt contains 561 rows of colnames that correspond to 561 of the 563 columns of mergedfile

Data

Data is brought in from How to Merge

More on this example can be found in Case Study: Samsung Fitbit - Tidying.

library(gt)
library(dplyr)
library(tidyverse)

Import File

The code below will perform the following:

  • read features.txt directly into features a table (since we know it’s only 561 rows it won’t affect our RAM capacity)
features <- read.table("D:/~/features.txt")
  • we know the first two columns of mergedfile will be assigned subject & activity accordingly
  • the remaining 561 columns will be assigned from the second column of features
gt(features[1:5, ]) |> opt_table_outline()
V1 V2
1 tBodyAcc-mean()-X
2 tBodyAcc-mean()-Y
3 tBodyAcc-mean()-Z
4 tBodyAcc-std()-X
5 tBodyAcc-std()-Y

Assign Colnames

  • we create a vector cnames of the column names
  • assign the vector to the column names of the mergedfile
  • what’s important is the number of objects in cnames should match the number of columns in mergedfile
cnames <- c("subject","activity",features)
colnames(mergedfile) <- cnames
  • confirm the code by looking at the first 3 columns of mergedfile
dim(mergedfile)
[1] 10299   563
gt(mergedfile[1:5,1:3]) |> opt_table_outline()
subject activity 1:561
2 5 0.2571778
2 5 0.2860267
2 5 0.2754848
2 5 0.2702982
2 5 0.2748330

External PM2.5


Case Study

  • We were given an index of the columns we will use in wcol
  • We were given a complete list of column names in cnames
  • So we need to extract the names from cnames corresponding to the wcol indices and assign to the column names of pm0

Data

We’ll be using another version of the EPA PM2.5 data is brought over from Case Study - EPA - EDA 3. This version has been modified and saved locally in several csv files.

  • pm0 contains data from 1999
  • pm1 contains data from2012
  • cnames.txt is a variable containing list of characters for column names
  • wcol_df is a df holds the indices of the 5 columns we
  • site0 and site1 are for sensors in the NY state area for 1999 and 2012 respectively described as County.Code and Site.ID concatenated together with “.” as seperator

preview

  • Here is what the data looks like
  • As you can see the headers are not descriptive and
  • We only have 5 columns
library(gt)
gt(pm0[1:5,]) |> opt_table_outline()
V1 V2 V3 V4 V5
1 27 1 19990103 NA
1 27 1 19990106 NA
1 27 1 19990109 NA
1 27 1 19990112 8.841
1 27 1 19990115 14.920
  • Here is the complete list of column names we’ll choose from
  • It’s a list of 1: chr
  • It appears that we need to clean it up before use
gt_preview(cnames)
data
1 # RD|Action Code|State Code|County Code|Site ID|Parameter|POC|Sample Duration|Unit|Method|Date|Start Time|Sample Value|Null Data Code|Sampling Frequency|Monitor Protocol (MP) ID|Qualifier - 1|Qualifier - 2|Qualifier - 3|Qualifier - 4|Qualifier - 5|Qualifier - 6|Qualifier - 7|Qualifier - 8|Qualifier - 9|Qualifier - 10|Alternate Method Detectable Limit|Uncertainty
  • Here is the index list of the column names we’ll use to assign to our data
  • We have the 5 indexes to match our 5 columns
  • We have to extract these 5 colnames from cnames and assign them to our data
gt(wcol_df)|> opt_table_outline()
wcol
3
4
5
11
13

strsplit

  • cnames is a list of colnames seperated by | so

let’s split the vector by | and get the column names only. With fixed=TRUE we are asking to match | exactly.

cnames_stripped <- strsplit(cnames, "|", fixed=TRUE)
gt_preview(cnames_stripped[[1]][1:5])
data
1 # RD
2 Action Code
3 State Code
4 County Code
5 Site ID

Manual Assignment

convert df col to vector

  • as it stands we have wcol_df a dataframe

make a vector out of the first column

wcol_vector <- wcol_df[[1]]
gt(wcol_df)|> opt_table_outline()
wcol
3
4
5
11
13

subset stripped list

  • we already have a total stripped list from above cnames_stripped
  • we can subset/filter out the ones we need by simply

subset the total list using the vector of indices wcol_vector

# could use one line of code  cnames_stripped[[1]][wcol_df[[1]]]
cnames_stripped[[1]][wcol_vector]
[1] "State Code"   "County Code"  "Site ID"      "Date"         "Sample Value"

assign to data

# let's first create a copy of pm0
assigned_pm0 <- pm0
names(assigned_pm0) <- cnames_stripped[[1]][wcol_df[[1]]]
gt(assigned_pm0[1:5,])|> opt_table_outline()
State Code County Code Site ID Date Sample Value
1 27 1 19990103 NA
1 27 1 19990106 NA
1 27 1 19990109 NA
1 27 1 19990112 8.841
1 27 1 19990115 14.920

Make.names

Syntax

make.names(names, unique = FALSE, allow_ = TRUE)
names character vector to be coerced to syntactically valid names. This is coerced to character if necessary.
unique logical; if TRUE, the resulting elements are unique. This may be desired for, e.g., column names.
allow_ logical. For compatibility with R prior to 1.9.0.
  • Another option is to use make.names which makes syntactically valid names out of character vectors
  • Most of the work has been done in the Manual Assignment section
  • All we do is insert make.names right before cnames_stripped
# let's first create a copy of pm0
named_pm0 <- pm0
names(named_pm0) <- make.names(cnames_stripped[[1]][wcol_df[[1]]])
gt(named_pm0[1:5,])|> opt_table_outline()
State.Code County.Code Site.ID Date Sample.Value
1 27 1 19990103 NA
1 27 1 19990106 NA
1 27 1 19990109 NA
1 27 1 19990112 8.841
1 27 1 19990115 14.920

Do you notice the difference between assigned_pm0 and named_pm0?

The latter was created with make.names which creates Systactically Valid Names (meaning: A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as “.2way” are not valid, and neither are the reserved words.)

Create Vector


  • As part of the code chunk above you notice a vector cnames that we created which included an external table features
  • So if we didn’t have an external file we just create the vector as we did here and proceed with assigning the vector to the header
  • As noted above, length of cnames should match the column count in df
cnames <- c("subject", "activity", "santa")
colnames(df) <- cnames

Within Dataset