Assign Colnames

At times the colnames are either

not included in the dataset but were in another file/table: features.txt and are therefore imported over
made up and need to be inserted into a df locally
are one of the rows in the dataset

External - Fitbit

Case Study

mergedfile is a df of test and train data merged together from two separate files
both files prior to being merged had headers were not descriptive of the variables
mergedfile does therefore have original colnames that need to be reassigned new ones provided in
features.txt contains 561 rows of colnames that correspond to 561 of the 563 columns of mergedfile

Data

Data is brought in from How to Merge

More on this example can be found in Case Study: Samsung Fitbit - Tidying.

library(gt)
library(dplyr)
library(tidyverse)

Import File

The code below will perform the following:

read features.txt directly into features a table (since we know it’s only 561 rows it won’t affect our RAM capacity)

features <- read.table("D:/~/features.txt")

we know the first two columns of mergedfile will be assigned subject & activity accordingly
the remaining 561 columns will be assigned from the second column of features

gt(features[1:5, ]) |> opt_table_outline()

V1	V2
1	tBodyAcc-mean()-X
2	tBodyAcc-mean()-Y
3	tBodyAcc-mean()-Z
4	tBodyAcc-std()-X
5	tBodyAcc-std()-Y

Assign Colnames

we create a vector cnames of the column names
assign the vector to the column names of the mergedfile
what’s important is the number of objects in cnames should match the number of columns in mergedfile

cnames <- c("subject","activity",features)
colnames(mergedfile) <- cnames

confirm the code by looking at the first 3 columns of mergedfile

dim(mergedfile)

[1] 10299   563

gt(mergedfile[1:5,1:3]) |> opt_table_outline()

subject	activity	1:561
2	5	0.2571778
2	5	0.2860267
2	5	0.2754848
2	5	0.2702982
2	5	0.2748330

External PM2.5

Case Study

We were given an index of the columns we will use in wcol
We were given a complete list of column names in cnames
So we need to extract the names from cnames corresponding to the wcol indices and assign to the column names of pm0

Data

We’ll be using another version of the EPA PM2.5 data is brought over from Case Study - EPA - EDA 3. This version has been modified and saved locally in several csv files.

pm0 contains data from 1999
pm1 contains data from2012
cnames.txt is a variable containing list of characters for column names
wcol_df is a df holds the indices of the 5 columns we
site0 and site1 are for sensors in the NY state area for 1999 and 2012 respectively described as County.Code and Site.ID concatenated together with “.” as seperator

preview

Here is what the data looks like
As you can see the headers are not descriptive and
We only have 5 columns

library(gt)
gt(pm0[1:5,]) |> opt_table_outline()

V1	V2	V3	V4	V5
1	27	1	19990103	NA
1	27	1	19990106	NA
1	27	1	19990109	NA
1	27	1	19990112	8.841
1	27	1	19990115	14.920

Here is the complete list of column names we’ll choose from
It’s a list of 1: chr
It appears that we need to clean it up before use

gt_preview(cnames)

	data
1	# RD\|Action Code\|State Code\|County Code\|Site ID\|Parameter\|POC\|Sample Duration\|Unit\|Method\|Date\|Start Time\|Sample Value\|Null Data Code\|Sampling Frequency\|Monitor Protocol (MP) ID\|Qualifier - 1\|Qualifier - 2\|Qualifier - 3\|Qualifier - 4\|Qualifier - 5\|Qualifier - 6\|Qualifier - 7\|Qualifier - 8\|Qualifier - 9\|Qualifier - 10\|Alternate Method Detectable Limit\|Uncertainty

Here is the index list of the column names we’ll use to assign to our data
We have the 5 indexes to match our 5 columns
We have to extract these 5 colnames from cnames and assign them to our data

gt(wcol_df)|> opt_table_outline()

wcol
3
4
5
11
13

strsplit

cnames is a list of colnames seperated by | so

let’s split the vector by | and get the column names only. With fixed=TRUE we are asking to match | exactly.

cnames_stripped <- strsplit(cnames, "|", fixed=TRUE)

gt_preview(cnames_stripped[[1]][1:5])

	data
1	# RD
2	Action Code
3	State Code
4	County Code
5	Site ID

Manual Assignment

convert df col to vector

as it stands we have wcol_df a dataframe

make a vector out of the first column

wcol_vector <- wcol_df[[1]]
gt(wcol_df)|> opt_table_outline()

wcol
3
4
5
11
13

subset stripped list

we already have a total stripped list from above cnames_stripped
we can subset/filter out the ones we need by simply

subset the total list using the vector of indices wcol_vector

# could use one line of code  cnames_stripped[[1]][wcol_df[[1]]]
cnames_stripped[[1]][wcol_vector]

[1] "State Code"   "County Code"  "Site ID"      "Date"         "Sample Value"

assign to data

# let's first create a copy of pm0
assigned_pm0 <- pm0
names(assigned_pm0) <- cnames_stripped[[1]][wcol_df[[1]]]
gt(assigned_pm0[1:5,])|> opt_table_outline()

State Code	County Code	Site ID	Date	Sample Value
1	27	1	19990103	NA
1	27	1	19990106	NA
1	27	1	19990109	NA
1	27	1	19990112	8.841
1	27	1	19990115	14.920

Make.names

Syntax

make.names(names, unique = FALSE, allow_ = TRUE)

`names`	character vector to be coerced to syntactically valid names. This is coerced to character if necessary.
`unique`	logical; if `TRUE`, the resulting elements are unique. This may be desired for, e.g., column names.
`allow_`	logical. For compatibility with R prior to 1.9.0.

Another option is to use make.names which makes syntactically valid names out of character vectors
Most of the work has been done in the Manual Assignment section
All we do is insert make.names right before cnames_stripped

# let's first create a copy of pm0
named_pm0 <- pm0
names(named_pm0) <- make.names(cnames_stripped[[1]][wcol_df[[1]]])
gt(named_pm0[1:5,])|> opt_table_outline()

State.Code	County.Code	Site.ID	Date	Sample.Value
1	27	1	19990103	NA
1	27	1	19990106	NA
1	27	1	19990109	NA
1	27	1	19990112	8.841
1	27	1	19990115	14.920

Do you notice the difference between assigned_pm0 and named_pm0?

The latter was created with make.names which creates Systactically Valid Names (meaning: A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as “.2way” are not valid, and neither are the reserved words.)

Create Vector

As part of the code chunk above you notice a vector cnames that we created which included an external table features
So if we didn’t have an external file we just create the vector as we did here and proceed with assigning the vector to the header
As noted above, length of cnames should match the column count in df

cnames <- c("subject", "activity", "santa")
colnames(df) <- cnames

Within Dataset

Header

At times the header in a csv file is your row that contains the column names. So always read the documentation that accompany the data you are importing to understand the schema and the metadata of what you are working with.
Here we know the column names occupy the first row so we instruct our read() function of that fact with header=TRUE

storm_data <- read.csv("D:/Education/R/Data/JH_C5_week2/
                       repdata_data_StormData.csv.bz2", header = TRUE)