Clean - Edit

We’ll be using the data that comes built-in R

Data

packages

library(tidyverse)
library(janitor)  # for clean_names
#this will be used in reshape section
#install.packages("reshape")
library(reshape2)  # for melt
library(data.table)  #for dcast

data

To see the list of default data use

data()

To load a specific dataset

data("mtcars")

clean_names

An R function that makes sure your column names are unique and consistent to avoid any errors during analysis.
Requires “janitor” package
Syntax: clean_name(df, …)
This package follows the principles of the “tidyverse” and in particular works well with the %>% pipe function. Janitor was built with beginning-to-intermediate R users in mind and is optimized for user-friendliness. Advanced users can already do everything covered here, but they can do it faster with janitor and save their thinking for more fun tasks.

x <- data.frame(caseID = 1, DOB = 2, Other = 3)
clean_names(x) 

# or pipe in the input data.frame:
x |>  clean_names()

# if you prefer camelCase variable names:
x |>  clean_names(.,"lower_camel")

readxl

You can run it directly after you read the data

library(readxl)
read_excel("messy_excel_file.xlsx") |>  clean_names() 

# Restore column names to Title Case, e.g., for plotting
mtcars |>  clean_names(case = "title")

abbreviations

You can tell clean_names to leave certain abbreviations untouched

x |> clean_names(case = "upper_camel", abbreviations = c("ID", "DOB"))

round

We can round integers in multiple ways one of them is round()
If we don’t want any decimals set digits = 0

median = round(median(ride_length), digits = 0)

Replace

change values

What if we want to change values in a vector that meet a certain condition to another value?

Set all elements of this vector that are greater than 10 to be equal 4

x <- c(17, 14, 4, 5, 13, 12, 10)
x[x > 10] <- 4 

#___________________OR 
x[x>=11] <- 4

rename

Just as it sounds, we can rename a col name with the new name on the left of the = and old name on the right
Syntax: rename(data, new name= old name)
If pipe is being used you can omit data as you see below

trimmed_df |> 
        select(hotel, is_canceled, lead_time) |> 
        rename(hotel_type = hotel) 
head(trimmed_df)

rename multiple

To rename multiple columns at once. Left side of = is the new name and just chain them together

trips19 <- rename( trips19,
                   ride_id = trip_id,
                   started_at = start_time,
                   ended_at = end_time,
                   rideable_type = bikeid,
                   start_station_name = from_station_name,
                   start_station_id = from_station_id, 
                   end_station_name = to_station_name,
                   end_station_id = to_station_id,
                   member_casual = usertype)

sub

Let’s say the column names have “reviwer_id”, “time_left” and we want to go through the names, find the pattern and replace it with “” (no space).

sub() fixes the first instance it encounters and stops

Let’s look at the colnames first
substitute “” for “_”

names(df)
sub("_", "", names(df))

gsub

Same as sub() but it will replace all the instances instead of stopping after the first.

gsub("_","", names(df))

grep

Very good detailed examples regarding grep vs grepl can be found at Extract Column in the How To section

Let’s say I’m gonna look at the intersection variable.
So one thing that I might want to be able to do is find all of the intersections that include the Alameda as one of the roads.
grep will take as input a search string that you want to look for “Alameda” and
it will look through this variable and find all of the instances in that vector where the Alameda appears.

grep("Alameda", df$intersection)
[1]  4  5  36

You can also use value=TRUE as one of the parameters sent to grep.
Instead of telling you which of the elements it found that Alameda appears in, it will actually return the values where Alameda appears.
So you can see these are the three intersections where the Alameda appears in the df.

grep("Alameda", df$intersection, value=TRUE)
[1] "The Alameda & 33rd st"  "E 33rd & The Alameda"   "Harford...."

The other thing that you can do is you can look for values that don’t appear.
So say, for example, JeffStreet does not appear in the intersection data, and so
it will return integer (0) when that value does not appear.

grep("Jeffers", df$intersection)
integer(0)

length(grep("Jeffers", df$intersection))
[1] 0

grepl

Will look for Alameda
It will look for it in this intersection variable
And it will return a LIST that’s true whenever Alameda appears and false whenever Alameda doesn’t appear.

And so in this case you can see that in that three of the times the Alameda appears

so if I make a table of the true and false value
it’s true three of the times.

table(grepl("Alameda", df$intersection)
      FALSE   TRUE
      77       3

The other thing that you can do is you can subset.

For example, suppose I want to find all the cases where Alameda appears in the intersection
And if Alameda doesn’t appear, then
I want to subset to only that subset of the data
I can do that using this grepl command. So you can use that to subset your data based on certain searches that you want to be able to find.

subset

Below I subset df for rows that don’t contain “Alameda”

subgroup <- df[!grepl("Alameda", df$intersection),]

substr

You can pull out a substring from a string.
Let’s say I want to extract the first 6 characters of the string, I can use

substr("Jeffery is Santa's Helper", 1,6)
[1] "Jeffer"

Select

We already covered select() in filter.
Now, let’s say you are primarily interested in the following variables: ‘hotel’, ‘is_canceled’, and ‘lead_time’.
Create a new data frame with just those columns, calling it `trimmed_df` by adding the variable names to this code chunk:

trimmed_df <- bookings_df |> 
        select(hotel,is_canceled,lead_time )

Separate

This is a useful function, for all the times you deal with data that contain combined columns/variables and you want to separate them into multiple columns
Start with separate, and then the data frame we want to work with and the column we’d like to separate.
Then we’ll add what we’d like to split the name column into.
To separate the value in name_col into separate values and into 2 separate columns
We’ll just name these new columns, first name and last name.
And finally, we’ll tell R to separate the name column at the first blank space.

separate(employee_df,name_col,into=c("first_name","last_name"), sep = " ")
#we could use
separate(employee_df,name_col,c('first_name','last_name'), sep = ' ')

Unite

dplyr package
Syntax: Unite(data, col, …,sep=““)
Another common task is to combine data from different columns, the opposite of separate()
First argument is data, unless you are using a pipe
Second argument is the new name of the united column
Followed by a vector of names for the columns to be united
Followed by a sperator
In this example, you can combine the arrival month and year into one column using the unite() function. NOTE the differene in unite as opposed to separate:

unite(employee_df,'name_col',first_name,last_name, sep = ' ')

example_df <- bookings_df |> 
        select(arrival_date_year, arrival_date_month) |> 
        unite(arrival_month_year, c("arrival_date_month", "arrival_date_year"), sep = " ")
head(example_df)

Mutate

From dplyr as well
Similar to unite(), as it adds columns into one new one
Syntax: mutate(data, new_name = old1 + old2 + old3…)
Just as before if you are using it in a pipe setup then omit the data argument

You can also use the`mutate()` function to make changes to your columns. Let’s say you wanted to create a new column that summed up all the adults, children, and babies on a reservation for the total number of people. The mutate() function lets you change the data frame, not preview it.

example_df <- bookings_df |> 
        mutate(guests = adults + children + babies )
head(example_df)

conditional & mutate

Of course you can combine mutate with conditional and calculations

penguins |> 
        mutate(body_mass_kg = body_mass_g/1000,flipper_length_m = flipper_length_mm/1000)

Expressions

Regular expressions can be thought of as a combination of literals and metacharacters. If we compare that to natural language, think of literal text as the words and metacharacters as its grammar. Regular expressions have metacharacters that allow us to search through strings to identify specific patterns of interest.

Literal, consist of words that match exactly. So in the previous paragraph “expressions” would be a word we can search for literally and we find it as it appears several times. A match occurs ONLY if the exact sequence occurs.

Metacharacters are used if we want to match any occurrence of “express”, or if a word end in express, or start with it.

Start of Line

^

If we need to find a line that starts with “a character” for example, if we want any words/or character that start a line we use ^. So if we want a line that starts with i we use

^i

Metacharacter that represent the start of a line.
^i is any line that starts with (i).
So if (i) appears anywhere else we don’t pick it up

End of Line

$

morning$

Represents the end of the line (morning$) is the word morning appearing at the end of the line

Anywhere

[Bb][Uu][Ss][Hh]

[Bb][Uu][Ss][Hh]

List of characters that appear at any point in the match
(B) represents the capital B and (b) represents the lower case b
So anytime BUSH/buSH/Bush/… appear it’s a mtach

Char to Start Line

^[Ii]

^[Ii]

Beginning of a line and either a I or an i

Range to Start a line

^[a-z] or [a-zA-Z]

^[0-9][a-zA-Z]

To match a range of letters lowercase, cap or mixed [a-zA-Z
So ^[0-9][a-zA-Z] will look for the beginning of the line that starts with any number followed by lowercase or uppercase letter

Negation

[^?.]$

[^?.]$

If we use ^ inside the brackets it means the negative, or NOT
so [^?.]$ means: $ is end of the line, so we are looking for anything that ends the line and
[^ ] means NOT a (? or .)
so all together comes out as: we are looking for a line that does NOT end with a ? or .

Any Char

.

(.) means that it could be any character.
So: 9.11 means anything that starts with 9 and ends with 11 and
could have any character in between. 9-11 9/11 9.11 9:11 0.12233499123

Or

|

flood | fire

Combines two different expressions. flood | fire will match any line that contains either flood or fire or both in it
We can string it out as turkey|on|bird|day|…..

Expressions Start of Line

regular expressions

^[Gg]ood | [Bb]ad

We could structure regular expressions to search for. ^[Gg]ood | [Bb]ad
so here it will match any combination of Good or Bad regardless of the capitalization of the first letter of each word
What’s important here is that Ggood has to be at the beginning of the line, but Bbad could be anywhere (it has no ^)

Options to Start of Line

subexpressions

^([Gg]ood | [Bb]ad)

Here you notice the parenthesis right after the ^ which means that either word HAVE to start the line

Optional

? optional

[Gg]eorge ([Ww]\.)? [Bb]ush

the question mark ? outside and trailing the [Ww] parenthesis means it’s optional.
So, regardless whether that condition is met or not the other matches will be accepted
We escaped the char (.) by adding the (\)

Any char

*

h(.*)s

(*) means repeat any number of times,
so we are searching for anything between parenthesis and can be repeated any number of times
in the code above: we are looking for something that starts with h - any characters in between - ends with s
note: it will always extract the longest string
These will match: (he means business ..)
example below

^s(.*)s

We have a start of line with s
followed by a string of anything and
ending in s
Example: string me up I’m going to starbucks

One time Any Char

Less greedy *

This is an extension the * from above
We can turn off the fact that it continues till the end by using ? like this

^s(.*?)s$

Start the line with s,
follow by unlimited number of any characters .*
but with ? which means it’ll stop at the first word that ends with an s

At Least One

+ at least one items

[0-9]+ (.*)[0-9]+

[]+ the + following the[ ] means a start of at least one number
followed by at least one of either (.*) any character followed by
[0-9]+ at least one number

Number of Matches

[Bb]ush ( +[^ ]+ +){1,5} debate

{1,5} specify the minimum or maximum number of matches
so we want to see a match between 1 and 5 times.
Line starts with either Bbush
ends with debate
and in between ( +[^ ] means: Bush followed by a space ( +…
^ inside a bracket means NOT/Negation then [^ >> Not a space followed by
- - a space and it needs to happen
between {1-5} times and
needs to end with debate
so we need to find Bush debate
- {m,n} means at least m but not more than n
- {m} means exactly m matches
- {m,} means at least m matches

Exact Match Replication

+([a-zA-Z]+) +\1 +

 +([a-zA-Z]+) +\1 +

We want a space : < >+
+followed by at least one character [a-zA-Z]
followed by one space ..) < >+\
followed by the same exact match/replication we saw in the parenthesis \1 +
Note: I used < > to indicate a space but in reality it is just a space not < >

Examples would be:

time for bed night night santa! - so as you see we have night followed by a space and a duplication of night.
my head is so so itchy
Or we are all all alone
Blah blah blah