<- 5
x if (x>3) {w <-48} else {w<- x}
w
[1] 48
A conditional statement is a declaration that if a certain condition holds, then a certain event must take place. For example, “If the temperature is above freezing, then I will go outside for a walk.” If the first condition is true (the temperature is above freezing), then the second condition will occur (I will go for a walk). Conditional statements in R code have a similar logic.
Let’s discuss how to create conditional statements in R using three related statements:
if (x > 0) {print("x is a positive number") }
if (x > 0) { print("x is a positive number") }
else { print ("x is either a negative number or zero") }
In some cases, you might want to customize your conditional statement even further by adding the else if statement.
The else if statement comes in between the if statement and the else statement. This is the code structure:
if (condition1) { expr1 }
else if (condition2) { expr2 }
else { expr3 }
Else is not necessary, if condition is false we can skip it and go to another if <condition> and on and on
if(<condition) { do something }
else { do something else }
if(<condition1) { do something }
else if(<condition2) { do something different }
else { do something different }
another way of thinking about it is this way:
if(x>3) { y <- 10 }
else { y <- 0 }
or this one
y <- if(x>3) { 10 }
else { 0 }
Look at this
You have a loop index (i) and it will loop over a sequence 1:10, when the loop is done it will continue down the code. For example let’s create a vector here xx
Notice the difference on how we can use .. in.. to loop through:
[1] "a"
[1] "b"
[1] "c"
[1] "d"
#same as this since this is just one short statement we can omit the {} like this
for (i in 1:4) print(xx[i])
[1] "a"
[1] "b"
[1] "c"
[1] "d"
[1] "a"
[1] "b"
[1] "c"
[1] "d"
[1] NA
[1] NA
[1] NA
[1] NA
[1] "a" "b" "c" "d"
[1] "a" "b" "c" "d"
[1] "a" "b" "c" "d"
[1] "a" "b" "c" "d"
[1] 1
[1] 2
[1] 3
[1] 4
[1] "a" "b" "c" "d"
[1] "a" "b" "c" "d"
[1] "a" "b" "c" "d"
[1] "a" "b" "c" "d"
The seq_along() function in R is used to generate a sequence of integers along the length of its argument.
seq along takes a vector as input and creates an integer sequence that’s equal in length to that vector, here the length is derived from the vector itself. This is very useful if we have a long dataset and we have no idea how long it is. We let seq_along() figure the length.
Here is another example of seq_along()
The index used in the for loop doesn’t need to be an integer like i that was used above. We can use a word, like here:
Of course we can nest for loops, but try to stay at 2-3 max because they get pretty hard to read for someone else. Nested are good for reading a matrix. Let’s say we have a row=2, col=3 matrix and we want to loop through it and read it
Similar to seq_along except this reads the length of the rows in the first instance
Here you’ll see in the first loop that it never executes because the counter is outside the conditions before we start. So if we reset count = 3 then it will run:
count <- 2
while (count >=3 && count <= 10)
{print(paste("yasha ate",count,"cookies", sep =" "))
count <- count+1 }
count <- 5
while (count >=3 && count <= 10)
{print(paste("yasha ate",count,"cookies", sep =" "))
count <- count+1 }
[1] "yasha ate 5 cookies"
[1] "yasha ate 6 cookies"
[1] "yasha ate 7 cookies"
[1] "yasha ate 8 cookies"
[1] "yasha ate 9 cookies"
[1] "yasha ate 10 cookies"
rbinom(x,y,z):
Repeat initiates an infinite loop, these are not commonly used in statistical applications, but they do have their uses. The only way to stop them is to call a break.
x0 <- 1
tol <- 1e-8
repeat {x1 <- computeEstimate()
if(abs(x1-x0) < tol) {break}
else {x0 <- x1}
}
Is used in any type of looping construct when you want to skip an iteration.
for (i in 1:100){
if(i<=20) {next}
print(i) }
Signals that a function should stop/exit and return a value. will look at it in Functions
In this document we’ll cover Functions and Loops. Both tend to occur together most of the time. We’ll start with functions that loop by nature and don’t need a loop so to speak to make them loop through all the rows of a data frame.
Loop functions come in handy when using the command line. But because of their simplicity we tend to use them when programming as well. Here we’ll cover the following:
anti_join() is a dplyr functions that returns all rows from x without a match in y
semi_join() is a dplyr functions that returns all rows from x with a match in y
This function allows you to vectorise multiple
switch()
statements. Each case is evaluated sequentially and the first match for each element determines the corresponding value in the output vector. If no cases match, the.default
is used.
case_match()
is an R equivalent of the SQL “simple”CASE WHEN
statement.
Syntax:
See a good example in How To: Convert Column
More information is found here on case_match & case_when. Let’s say we just performed a bind_rows() or rbind() /merged two datasets together, and we want to change the values in one column to match the other values in the merged dataset.
Here we change all values to make sure we have a uniform dataset. One df had the column with values of member or casual, and a few years later the company decided to use subscriber & customer. So when we merged the two df together, we ended up with 4 values for the same column.
After merge we ended up with 4 values in one column where we need to only have two. So to replace the old values with the renamed ones, we can use case() like this to replace member with Subscriber and casual with Customer
#_________________So let's replace values in col(member_casual) that are:
#member & casual with Subscriber & Customer
all_trips19_20 <- trim_trips19_20 |>
mutate(member_casual = case_when(
member_casual == "member" ~ "Subscriber",
member_casual == "casual" ~ "Customer",
member_casual == "Subscriber" ~ "Subscriber",
TRUE ~ "Customer" ))
The lapply(list, function, …) does a series of operations. Each of the apply functions will split up some data into smaller pieces, apply a function to each piece, then combine the results.
We can use lapply to read an entire list of files.
Let’s lapply() to find the mean of two lists a, b
$a
[1] 1 2 3 4 5
$b
[1] -0.09554446 0.09151091 0.08724459 -0.16105597 0.03342112 -0.37277600
[7] 0.96397316 -0.57498766 -3.05720241 1.34645172
$a
[1] 3
$b
[1] -0.1738965
Let’s create a matrix
Let’s pull the second column from both matrices using lapply and create an anonymous function (one that only exists in this environment as we are about to make it up now )
Let’s use the default dataset iris and find
See tapply() below for another answer to solve the same question.
This will return a simplified result. Look at split example, now let’s use sapply instead of lapply:
Let’s say we want to go through the entire list and split all the Names. We can use split inside sapply within a function that we can call that splits all the column names and returns the first row in the list [1]
As you see above some rows have NA so not all columns can be calculated. Let’s remove NAs
Is used to evaluate a function over the margins of an array. Often to apply a function to the rows or columns of a matrix (which is a 2 dimensional array). For example you can take the average of an array of matrices.
Syntax:apply(x, margin, fun, …)
# we preserve 2=columns that mean we get the mean of each column
x <- matrix(rnorm(200), 20, 10)
apply(x, 2, mean)
[1] 0.05414183 -0.04547356 0.07998758 0.14985420 0.18250894 0.01788141
[7] -0.11344563 0.02771391 -0.06979407 0.29995799
[1] 0.40544838 0.41852062 0.88402390 0.43815163 0.15892825 0.47504460
[7] -0.11451130 0.44380040 0.09284149 0.20627488 -0.55106774 0.06142513
[13] -0.20506548 0.29201586 -0.41450291 -0.17892319 -0.05640784 -0.52402650
[19] -0.37012584 -0.29517914
What if we want to a vector of the means of the variables ‘Sepal.Length’, ‘Sepal.Width’, ‘Petal.Length’, and ‘Petal.Width’ from the iris dataset?
Is a combination of split() and sapply() for vectors only. tapply() is used to apply a function over a subset of a vector. Here you’ll split your data up into a group based on the value of some variable, then apply a function to the members of the group.
syntax: tapply(x, INDEX, FUN, …, simplify) where
Refer to Factors – R for Data Science (2e) for more information.
Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.
We’ll start by motivating why factors are needed for data analysis and how you can create them with factor()
.
Base R provides some basic tools for creating and manipulating factors. We’ll supplement these with the forcats package, which is part of the core tidyverse. It provides tools for dealing with categorical variables (and it’s an anagram of factors!) using a wide range of helpers for working with factors.
gl() is used to define groups
To define some groups with a factor variable we use k=10 to match the length of x which is 30
To create a factor you must start by creating a list of the valid levels:
x1 <- c("Dec", "Apr", "Jan", "Mar")
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
y1 <- factor(x1, levels = month_levels)
y1
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Sort outcome
Silent NAs
[1] Dec Apr <NA> Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Omit levels
We’ll use the flags dataset here again. The “landmass” variable takes on integer values between 1 and 6. Let’s use the first line of code to see how many flags fall into each group
table(flags$landmass)
1 2 3 4 5 6
31 17 35 52 39 20
table(flags$animate)
#__this tells us how many flags contain an animated object in their flag
0 1 155 39
What if want to calculate the mean to the animate variable separately for each of the six landmass groups, that way we find out which portion of the flags contain an animate image WITHIN each landmass group.
So we will place the animate group first in the function followed by the landmass group like this:
tapply(flags$animate, flags$landmass, mean)
1 2 3 4 5 6
0.4193548 0.1764706 0.1142857 0.1346154 0.1538462 0.3000000
#___ the first landmass is 1 = N.
# America contains the highest portion of flags with an animate image
What if we want to find the population (in millions) for countries with and without the color red on their flags? In the output below, note that 0 indicates flags WITHOUT red in it and 1 for flags WITH red in it.
tapply(flags$population, flags$red, summary)
$`0`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 3.00 27.63 9.00 684.00
$`1`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 4.0 22.1 15.0 1008.0
What if we want to find the population (in millions) for all landmasses?
tapply(flags$population, flags$landmass, summary)
$`1`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 0.00 12.29 4.50 231.00
$`2`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 6.00 15.71 15.00 119.00
$`3`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 8.00 13.86 16.00 61.00
$`4`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 5.000 8.788 9.750 56.000
$`5`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 2.00 10.00 69.18 39.00 1008.00
$`6`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 0.00 11.30 1.25 157.00
Using the mtcars dataset, we want to calculate the average miles per gallon (mpg) by number of cylinders in the car (cyl). Here multiple ways this can be done:
tapply(mtcars$mpg, mtcars$cyl, mean)
sapply(split(mtcars$mpg, mtcars$cyl), mean)
Watch how the same code above can be written using with()- with() can only be used with dataframes.
with(mtcars, tapply(mpg, cyl, mean))
Using the mtcars dataset again, what’s the difference in hp between 4 and 8 cylinders?
tapply(mtcars$hp, mtcars$cyl, mean)
4 6 8
82.63636 122.28571 209.21429
Let’s use the default dataset iris and find
tapply(iris$Sepal.Length, iris$Species, mean)
setosa versicolor virginica
5.006 5.936 6.588
Is kinda of a fancy apply, it applies a function in parallel over a set of arguments. lapply and the others iterate over a single R object, mapply iterates over multiple R objects in parallel.
syntax: mapply(fun,… ,moreargs, simplify)
[[1]]
[1] 1 1 1
[[2]]
[1] 2 2
[[3]]
[1] 3
Allows you to specify explicitly the format of the results
Now let’s suppose we want to assign a numeric vector of length (1) to the returned value of unique, let’s use vapply(), look at the code below and let’s see what the output is
vapply(flags, unique, numeric(1))
Error in vapply(flags, unique, numeric(1)) : values must be length 1,
but FUN(X[[1]]) result is length 194
Now let’s try it again with something else, we already displayed the class() of all elements in flags with sapply() let’s do it again
sapply(flags, class)
name landmass zone area population language religion bars stripes colours red "character" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" green blue gold white black orange mainhue circles crosses saltires quarters "integer" "integer" "integer" "integer" "integer" "integer" "character" "integer" "integer" "integer" "integer" sunstars crescent triangle icon animate text topleft botright "integer" "integer" "integer" "integer" "integer" "integer" "character" "character"
It is obvious from the output that we have many different classes for all the columns in the data set. What if we want to assign all the classes to character. Let’s see in the code how we assign the class to being a character vector of length 1
vapply(flags, class, character(1))
name landmass zone area population language religion bars stripes colours red "character" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" green blue gold white black orange mainhue circles crosses saltires quarters "integer" "integer" "integer" "integer" "integer" "integer" "character" "integer" "integer" "integer" "integer" sunstars crescent triangle icon animate text topleft botright "integer" "integer" "integer" "integer" "integer" "integer" "character" "character"
split(x, f, drop)
split() is usually used with lapply().
$`1`
[1] 0.30690859 -0.07689829 -0.32973287 1.87096565 -0.12805030 0.94880053
[7] 0.09686608 -0.04668742 -1.16425495 -0.29149315
$`2`
[1] 0.47301698 0.12071165 0.09234822 0.47730813 0.50042387 0.47376869
[7] 0.75344793 0.10455063 0.02543205 0.06343151
$`3`
[1] 0.78452350 1.20306303 0.54648064 0.03639915 1.80512956 0.68396469
[7] 2.03985863 0.53653137 1.92598369 -0.24756685
Or something like this:
As we stated above, split will break the data into groups and then you can perform some kind of calculation over the groups using lapply
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
$`5`
Ozone Solar.R Wind Temp Month Day
NA NA 11.62258 65.54839 5.00000 16.00000
$`6`
Ozone Solar.R Wind Temp Month Day
NA 190.16667 10.26667 79.10000 6.00000 15.50000
$`7`
Ozone Solar.R Wind Temp Month Day
NA 216.483871 8.941935 83.903226 7.000000 16.000000
$`8`
Ozone Solar.R Wind Temp Month Day
NA NA 8.793548 83.967742 8.000000 16.000000
$`9`
Ozone Solar.R Wind Temp Month Day
NA 167.4333 10.1800 76.9000 9.0000 15.5000
What if we want just the mean of Wind column
Just swap out the column name with a list of names
You’re probably thinking what’s count doing in this section? well we can figure out the answer to the above question in another way without having to split the dataset.
This code will create a list of all States and the Count of how many times they occured in the rows, and sort them by State in ascending order (default)
These are self explanatory as to what each accomplishes
rowSums = apply(x, 1, sum)
rowMeans = apply(x, 1, mean)
colSums = apply(x, 2, sum)
colMeans = apply(x, 2, mean)