library(tidyverse)
library(janitor) # for clean_names
#this will be used in reshape section
#install.packages("reshape")
library(reshape2) # for melt
library(data.table) #for dcast
Clean - Edit
We’ll be using the data that comes built-in R
Data
packages
data
To see the list of default data use
data()
To load a specific dataset
data("mtcars")
clean_names
- An R function that makes sure your column names are unique and consistent to avoid any errors during analysis.
- Requires “janitor” package
- Syntax: clean_name(df, …)
- This package follows the principles of the “tidyverse” and in particular works well with the
%>%
pipe function. Janitor was built with beginning-to-intermediate R users in mind and is optimized for user-friendliness. Advanced users can already do everything covered here, but they can do it faster with janitor and save their thinking for more fun tasks.
<- data.frame(caseID = 1, DOB = 2, Other = 3)
x clean_names(x)
# or pipe in the input data.frame:
|> clean_names()
x
# if you prefer camelCase variable names:
|> clean_names(.,"lower_camel") x
readxl
You can run it directly after you read the data
library(readxl)
read_excel("messy_excel_file.xlsx") |> clean_names()
# Restore column names to Title Case, e.g., for plotting
|> clean_names(case = "title") mtcars
abbreviations
You can tell clean_names to leave certain abbreviations untouched
|> clean_names(case = "upper_camel", abbreviations = c("ID", "DOB")) x
round
- We can round integers in multiple ways one of them is round()
- If we don’t want any decimals set digits = 0
= round(median(ride_length), digits = 0) median
Replace
change values
- What if we want to change values in a vector that meet a certain condition to another value?
Set all elements of this vector that are greater than 10 to be equal 4
<- c(17, 14, 4, 5, 13, 12, 10)
x > 10] <- 4
x[x
#___________________OR
>=11] <- 4 x[x
rename
- Just as it sounds, we can rename a col name with the new name on the left of the = and old name on the right
- Syntax: rename(data, new name= old name)
- If pipe is being used you can omit data as you see below
|>
trimmed_df select(hotel, is_canceled, lead_time) |>
rename(hotel_type = hotel)
head(trimmed_df)
rename multiple
To rename multiple columns at once. Left side of = is the new name and just chain them together
<- rename( trips19,
trips19 ride_id = trip_id,
started_at = start_time,
ended_at = end_time,
rideable_type = bikeid,
start_station_name = from_station_name,
start_station_id = from_station_id,
end_station_name = to_station_name,
end_station_id = to_station_id,
member_casual = usertype)
sub
Let’s say the column names have “reviwer_id”, “time_left” and we want to go through the names, find the pattern and replace it with “” (no space).
sub() fixes the first instance it encounters and stops
- Let’s look at the colnames first
- substitute “” for “_”
names(df)
sub("_", "", names(df))
gsub
Same as sub() but it will replace all the instances instead of stopping after the first.
gsub("_","", names(df))
grep
Very good detailed examples regarding grep vs grepl can be found at Extract Column in the How To section
- Let’s say I’m gonna look at the intersection variable.
- So one thing that I might want to be able to do is find all of the intersections that include the Alameda as one of the roads.
- grep will take as input a search string that you want to look for “Alameda” and
- it will look through this variable and find all of the instances in that vector where the Alameda appears.
grep("Alameda", df$intersection)
1] 4 5 36 [
- You can also use value=TRUE as one of the parameters sent to grep.
- Instead of telling you which of the elements it found that Alameda appears in, it will actually return the values where Alameda appears.
- So you can see these are the three intersections where the Alameda appears in the df.
grep("Alameda", df$intersection, value=TRUE)
1] "The Alameda & 33rd st" "E 33rd & The Alameda" "Harford...." [
- The other thing that you can do is you can look for values that don’t appear.
- So say, for example, JeffStreet does not appear in the intersection data, and so
- it will return integer (0) when that value does not appear.
grep("Jeffers", df$intersection)
integer(0)
length(grep("Jeffers", df$intersection))
1] 0 [
grepl
- Will look for Alameda
- It will look for it in this intersection variable
- And it will return a LIST that’s true whenever Alameda appears and false whenever Alameda doesn’t appear.
And so in this case you can see that in that three of the times the Alameda appears
- so if I make a table of the true and false value
- it’s true three of the times.
table(grepl("Alameda", df$intersection)
FALSE TRUE
77 3
The other thing that you can do is you can subset.
- For example, suppose I want to find all the cases where Alameda appears in the intersection
- And if Alameda doesn’t appear, then
- I want to subset to only that subset of the data
- I can do that using this grepl command. So you can use that to subset your data based on certain searches that you want to be able to find.
subset
- Below I subset df for rows that don’t contain “Alameda”
<- df[!grepl("Alameda", df$intersection),] subgroup
substr
- You can pull out a substring from a string.
- Let’s say I want to extract the first 6 characters of the string, I can use
substr("Jeffery is Santa's Helper", 1,6)
1] "Jeffer" [
Select
- We already covered select() in filter.
- Now, let’s say you are primarily interested in the following variables: ‘hotel’, ‘is_canceled’, and ‘lead_time’.
- Create a new data frame with just those columns, calling it `trimmed_df` by adding the variable names to this code chunk:
<- bookings_df |>
trimmed_df select(hotel,is_canceled,lead_time )
Separate
- This is a useful function, for all the times you deal with data that contain combined columns/variables and you want to separate them into multiple columns
- Start with separate, and then the data frame we want to work with and the column we’d like to separate.
- Then we’ll add what we’d like to split the name column into.
- To separate the value in name_col into separate values and into 2 separate columns
- We’ll just name these new columns, first name and last name.
- And finally, we’ll tell R to separate the name column at the first blank space.
separate(employee_df,name_col,into=c("first_name","last_name"), sep = " ")
#we could use
separate(employee_df,name_col,c('first_name','last_name'), sep = ' ')
Unite
- dplyr package
- Syntax: Unite(data, col, …,sep=““)
- Another common task is to combine data from different columns, the opposite of separate()
- First argument is data, unless you are using a pipe
- Second argument is the new name of the united column
- Followed by a vector of names for the columns to be united
- Followed by a sperator
- In this example, you can combine the arrival month and year into one column using the unite() function. NOTE the differene in unite as opposed to separate:
unite(employee_df,'name_col',first_name,last_name, sep = ' ')
<- bookings_df |>
example_df select(arrival_date_year, arrival_date_month) |>
unite(arrival_month_year, c("arrival_date_month", "arrival_date_year"), sep = " ")
head(example_df)
Mutate
- From dplyr as well
- Similar to unite(), as it adds columns into one new one
- Syntax: mutate(data, new_name = old1 + old2 + old3…)
- Just as before if you are using it in a pipe setup then omit the data argument
You can also use the`mutate()` function to make changes to your columns. Let’s say you wanted to create a new column that summed up all the adults, children, and babies on a reservation for the total number of people. The mutate() function lets you change the data frame, not preview it.
<- bookings_df |>
example_df mutate(guests = adults + children + babies )
head(example_df)
conditional & mutate
Of course you can combine mutate with conditional and calculations
|>
penguins mutate(body_mass_kg = body_mass_g/1000,flipper_length_m = flipper_length_mm/1000)
Expressions
Regular expressions can be thought of as a combination of literals and metacharacters. If we compare that to natural language, think of literal text as the words and metacharacters as its grammar. Regular expressions have metacharacters that allow us to search through strings to identify specific patterns of interest.
Literal, consist of words that match exactly. So in the previous paragraph “expressions” would be a word we can search for literally and we find it as it appears several times. A match occurs ONLY if the exact sequence occurs.
Metacharacters are used if we want to match any occurrence of “express”, or if a word end in express, or start with it.
Start of Line
^
If we need to find a line that starts with “a character” for example, if we want any words/or character that start a line we use ^. So if we want a line that starts with i we use
^i
- Metacharacter that represent the start of a line.
- ^i is any line that starts with (i).
- So if (i) appears anywhere else we don’t pick it up
End of Line
$
morning$
- Represents the end of the line (morning$) is the word morning appearing at the end of the line
Anywhere
[Bb][Uu][Ss][Hh]
[Bb][Uu][Ss][Hh]
- List of characters that appear at any point in the match
- (B) represents the capital B and (b) represents the lower case b
- So anytime BUSH/buSH/Bush/… appear it’s a mtach
Char to Start Line
^[Ii]
^[Ii]
- Beginning of a line and either a I or an i
Range to Start a line
^[a-z] or [a-zA-Z]
^[0-9][a-zA-Z]
- To match a range of letters lowercase, cap or mixed [a-zA-Z
- So ^[0-9][a-zA-Z] will look for the beginning of the line that starts with any number followed by lowercase or uppercase letter
Negation
[^?.]$
[^?.]$
- If we use ^ inside the brackets it means the negative, or NOT
- so [^?.]$ means: $ is end of the line, so we are looking for anything that ends the line and
- [^ ] means NOT a (? or .)
- so all together comes out as: we are looking for a line that does NOT end with a ? or .
Any Char
.
.
- (.) means that it could be any character.
- So: 9.11 means anything that starts with 9 and ends with 11 and
- could have any character in between. 9-11 9/11 9.11 9:11 0.12233499123
Or
|
flood | fire
- Combines two different expressions. flood | fire will match any line that contains either flood or fire or both in it
- We can string it out as turkey|on|bird|day|…..
Expressions Start of Line
regular expressions
^[Gg]ood | [Bb]ad
- We could structure regular expressions to search for. ^[Gg]ood | [Bb]ad
- so here it will match any combination of Good or Bad regardless of the capitalization of the first letter of each word
- What’s important here is that Ggood has to be at the beginning of the line, but Bbad could be anywhere (it has no ^)
Options to Start of Line
subexpressions
^([Gg]ood | [Bb]ad)
- Here you notice the parenthesis right after the ^ which means that either word HAVE to start the line
Optional
? optional
[Gg]eorge ([Ww]\.)? [Bb]ush
- the question mark ? outside and trailing the [Ww] parenthesis means it’s optional.
- So, regardless whether that condition is met or not the other matches will be accepted
- We escaped the char (.) by adding the (\)
Any char
*
h(.*)s
- (*) means repeat any number of times,
- so we are searching for anything between parenthesis and can be repeated any number of times
- in the code above: we are looking for something that starts with h - any characters in between - ends with s
- note: it will always extract the longest string
- These will match: (he means business ..)
- example below
^s(.*)s
- We have a start of line with s
- followed by a string of anything and
- ending in s
- Example: string me up I’m going to starbucks
One time Any Char
Less greedy *
- This is an extension the * from above
- We can turn off the fact that it continues till the end by using ? like this
^s(.*?)s$
- Start the line with s,
- follow by unlimited number of any characters .*
- but with ? which means it’ll stop at the first word that ends with an s
At Least One
+ at least one items
[0-9]+ (.*)[0-9]+
- []+ the + following the[ ] means a start of at least one number
- followed by at least one of either (.*) any character followed by
- [0-9]+ at least one number
Number of Matches
[Bb]ush ( +[^ ]+ +){1,5} debate
- {1,5} specify the minimum or maximum number of matches
- so we want to see a match between 1 and 5 times.
- Line starts with either Bbush
- ends with debate
- and in between ( +[^ ] means: Bush followed by a space ( +…
- ^ inside a bracket means NOT/Negation then [^ >> Not a space followed by
- a space and it needs to happen
- between {1-5} times and
- needs to end with debate
- so we need to find Bush debate
- {m,n} means at least m but not more than n
- {m} means exactly m matches
- {m,} means at least m matches
Exact Match Replication
+([a-zA-Z]+) +\1 +
+([a-zA-Z]+) +\1 +
- We want a space : < >+
- +followed by at least one character [a-zA-Z]
- followed by one space ..) < >+\
- followed by the same exact match/replication we saw in the parenthesis \1 +
- Note: I used < > to indicate a space but in reality it is just a space not < >
Examples would be:
- time for bed night night santa! - so as you see we have night followed by a space and a duplication of night.
- my head is so so itchy
- Or we are all all alone
- Blah blah blah