library(gt)
library(dplyr)
library(tidyverse)
Assign Colnames
At times the colnames are either
- not included in the dataset but were in another file/table: features.txt and are therefore imported over
- made up and need to be inserted into a df locally
- are one of the rows in the dataset
External - Fitbit
Case Study
- mergedfile is a df of test and train data merged together from two separate files
- both files prior to being merged had headers were not descriptive of the variables
- mergedfile does therefore have original colnames that need to be reassigned new ones provided in
- features.txt contains 561 rows of colnames that correspond to 561 of the 563 columns of mergedfile
Data
Data is brought in from How to Merge
More on this example can be found in Case Study: Samsung Fitbit - Tidying.
Import File
The code below will perform the following:
- read features.txt directly into features a table (since we know it’s only 561 rows it won’t affect our RAM capacity)
<- read.table("D:/~/features.txt") features
- we know the first two columns of mergedfile will be assigned subject & activity accordingly
- the remaining 561 columns will be assigned from the second column of features
gt(features[1:5, ]) |> opt_table_outline()
V1 | V2 |
---|---|
1 | tBodyAcc-mean()-X |
2 | tBodyAcc-mean()-Y |
3 | tBodyAcc-mean()-Z |
4 | tBodyAcc-std()-X |
5 | tBodyAcc-std()-Y |
Assign Colnames
- we create a vector cnames of the column names
- assign the vector to the column names of the mergedfile
- what’s important is the number of objects in cnames should match the number of columns in mergedfile
<- c("subject","activity",features)
cnames colnames(mergedfile) <- cnames
- confirm the code by looking at the first 3 columns of mergedfile
dim(mergedfile)
[1] 10299 563
gt(mergedfile[1:5,1:3]) |> opt_table_outline()
subject | activity | 1:561 |
---|---|---|
2 | 5 | 0.2571778 |
2 | 5 | 0.2860267 |
2 | 5 | 0.2754848 |
2 | 5 | 0.2702982 |
2 | 5 | 0.2748330 |
External PM2.5
Case Study
- We were given an index of the columns we will use in wcol
- We were given a complete list of column names in cnames
- So we need to extract the names from cnames corresponding to the wcol indices and assign to the column names of pm0
Data
We’ll be using another version of the EPA PM2.5 data is brought over from Case Study - EPA - EDA 3. This version has been modified and saved locally in several csv files.
- pm0 contains data from 1999
- pm1 contains data from2012
- cnames.txt is a variable containing list of characters for column names
- wcol_df is a df holds the indices of the 5 columns we
- site0 and site1 are for sensors in the NY state area for 1999 and 2012 respectively described as County.Code and Site.ID concatenated together with “.” as seperator
preview
- Here is what the data looks like
- As you can see the headers are not descriptive and
- We only have 5 columns
library(gt)
gt(pm0[1:5,]) |> opt_table_outline()
V1 | V2 | V3 | V4 | V5 |
---|---|---|---|---|
1 | 27 | 1 | 19990103 | NA |
1 | 27 | 1 | 19990106 | NA |
1 | 27 | 1 | 19990109 | NA |
1 | 27 | 1 | 19990112 | 8.841 |
1 | 27 | 1 | 19990115 | 14.920 |
- Here is the complete list of column names we’ll choose from
- It’s a list of 1: chr
- It appears that we need to clean it up before use
gt_preview(cnames)
data | |
---|---|
1 | # RD|Action Code|State Code|County Code|Site ID|Parameter|POC|Sample Duration|Unit|Method|Date|Start Time|Sample Value|Null Data Code|Sampling Frequency|Monitor Protocol (MP) ID|Qualifier - 1|Qualifier - 2|Qualifier - 3|Qualifier - 4|Qualifier - 5|Qualifier - 6|Qualifier - 7|Qualifier - 8|Qualifier - 9|Qualifier - 10|Alternate Method Detectable Limit|Uncertainty |
- Here is the index list of the column names we’ll use to assign to our data
- We have the 5 indexes to match our 5 columns
- We have to extract these 5 colnames from cnames and assign them to our data
gt(wcol_df)|> opt_table_outline()
wcol |
---|
3 |
4 |
5 |
11 |
13 |
strsplit
- cnames is a list of colnames seperated by | so
let’s split the vector by | and get the column names only. With
fixed=TRUE
we are asking to match | exactly.
<- strsplit(cnames, "|", fixed=TRUE) cnames_stripped
gt_preview(cnames_stripped[[1]][1:5])
data | |
---|---|
1 | # RD |
2 | Action Code |
3 | State Code |
4 | County Code |
5 | Site ID |
Manual Assignment
convert df col to vector
- as it stands we have wcol_df a dataframe
make a vector out of the first column
<- wcol_df[[1]]
wcol_vector gt(wcol_df)|> opt_table_outline()
wcol |
---|
3 |
4 |
5 |
11 |
13 |
subset stripped list
- we already have a total stripped list from above cnames_stripped
- we can subset/filter out the ones we need by simply
subset the total list using the vector of indices wcol_vector
# could use one line of code cnames_stripped[[1]][wcol_df[[1]]]
1]][wcol_vector] cnames_stripped[[
[1] "State Code" "County Code" "Site ID" "Date" "Sample Value"
assign to data
# let's first create a copy of pm0
<- pm0
assigned_pm0 names(assigned_pm0) <- cnames_stripped[[1]][wcol_df[[1]]]
gt(assigned_pm0[1:5,])|> opt_table_outline()
State Code | County Code | Site ID | Date | Sample Value |
---|---|---|---|---|
1 | 27 | 1 | 19990103 | NA |
1 | 27 | 1 | 19990106 | NA |
1 | 27 | 1 | 19990109 | NA |
1 | 27 | 1 | 19990112 | 8.841 |
1 | 27 | 1 | 19990115 | 14.920 |
Make.names
Syntax
make.names(names, unique = FALSE, allow_ = TRUE)
names |
character vector to be coerced to syntactically valid names. This is coerced to character if necessary. |
unique |
logical; if TRUE , the resulting elements are unique. This may be desired for, e.g., column names. |
allow_ |
logical. For compatibility with R prior to 1.9.0. |
- Another option is to use make.names which makes syntactically valid names out of character vectors
- Most of the work has been done in the Manual Assignment section
- All we do is insert make.names right before cnames_stripped
# let's first create a copy of pm0
<- pm0
named_pm0 names(named_pm0) <- make.names(cnames_stripped[[1]][wcol_df[[1]]])
gt(named_pm0[1:5,])|> opt_table_outline()
State.Code | County.Code | Site.ID | Date | Sample.Value |
---|---|---|---|---|
1 | 27 | 1 | 19990103 | NA |
1 | 27 | 1 | 19990106 | NA |
1 | 27 | 1 | 19990109 | NA |
1 | 27 | 1 | 19990112 | 8.841 |
1 | 27 | 1 | 19990115 | 14.920 |
Do you notice the difference between assigned_pm0 and named_pm0?
The latter was created with make.names which creates Systactically Valid Names (meaning: A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as “.2way” are not valid, and neither are the reserved words.)
Create Vector
- As part of the code chunk above you notice a vector cnames that we created which included an external table features
- So if we didn’t have an external file we just create the vector as we did here and proceed with assigning the vector to the header
- As noted above, length of cnames should match the column count in df
<- c("subject", "activity", "santa")
cnames colnames(df) <- cnames
Within Dataset
Header
- At times the header in a csv file is your row that contains the column names. So always read the documentation that accompany the data you are importing to understand the schema and the metadata of what you are working with.
- Here we know the column names occupy the first row so we instruct our read() function of that fact with header=TRUE
<- read.csv("D:/Education/R/Data/JH_C5_week2/
storm_data repdata_data_StormData.csv.bz2", header = TRUE)