Conditionals - Looping F


A conditional statement is a declaration that if a certain condition holds, then a certain event must take place. For example, “If the temperature is above freezing, then I will go outside for a walk.” If the first condition is true (the temperature is above freezing), then the second condition will occur (I will go for a walk). Conditional statements in R code have a similar logic.

Conditional Statements


Let’s discuss how to create conditional statements in R using three related statements:

if

if (x > 0) {print("x is a positive number") }

else

if (x > 0) { print("x is a positive number") }
        else { print ("x is either a negative number or zero") }

else if

In some cases, you might want to customize your conditional statement even further by adding the else if statement.

The else if statement comes in between the if statement and the else statement. This is the code structure:

if (condition1) { expr1 }
        else if (condition2) { expr2 }
                else { expr3 }

Else is not necessary, if condition is false we can skip it and go to another if <condition> and on and on

if(<condition) { do something }
        else { do something else   }
if(<condition1) { do something } 
        else if(<condition2) { do something different }
                else { do something different }

another way of thinking about it is this way:

if(x>3) { y <- 10 }
        else { y <- 0 }

or this one

y <- if(x>3) { 10 } 
        else  { 0 }

Look at this

x <- 5
if (x>3)  {w <-48} else {w<- x} 
w
[1] 48

For Loop


You have a loop index (i) and it will loop over a sequence 1:10, when the loop is done it will continue down the code. For example let’s create a vector here xx

index in …

Notice the difference on how we can use .. in.. to loop through:

xx <- c("a","b","c","d")
for (i in 1:4) { print(xx[i]) }
[1] "a"
[1] "b"
[1] "c"
[1] "d"
#same as this since this is just one short statement we can omit the {} like this
for (i in 1:4) print(xx[i])
[1] "a"
[1] "b"
[1] "c"
[1] "d"
for (i in xx){print(i)}
[1] "a"
[1] "b"
[1] "c"
[1] "d"
for (i in xx){print(xx[i])} 
[1] NA
[1] NA
[1] NA
[1] NA
for (i in xx){print(xx)} 
[1] "a" "b" "c" "d"
[1] "a" "b" "c" "d"
[1] "a" "b" "c" "d"
[1] "a" "b" "c" "d"
for (i in 1:4){print(i)} 
[1] 1
[1] 2
[1] 3
[1] 4
for (i in 1:4){print(xx)}  #as you see this basically prints xx four times
[1] "a" "b" "c" "d"
[1] "a" "b" "c" "d"
[1] "a" "b" "c" "d"
[1] "a" "b" "c" "d"

seq_along

The seq_along() function in R is used to generate a sequence of integers along the length of its argument.

seq along takes a vector as input and creates an integer sequence that’s equal in length to that vector, here the length is derived from the vector itself. This is very useful if we have a long dataset and we have no idea how long it is. We let seq_along() figure the length.

for (i in seq_along(xx)){ print(xx[i]) }
[1] "a"
[1] "b"
[1] "c"
[1] "d"

Here is another example of seq_along()

seq_along(LETTERS[1:5])
[1] 1 2 3 4 5

index as string

The index used in the for loop doesn’t need to be an integer like i that was used above. We can use a word, like here:

for (goose in xx){print(goose)}
[1] "a"
[1] "b"
[1] "c"
[1] "d"

Nested

Of course we can nest for loops, but try to stay at 2-3 max because they get pretty hard to read for someone else. Nested are good for reading a matrix. Let’s say we have a row=2, col=3 matrix and we want to loop through it and read it

seq_len

Similar to seq_along except this reads the length of the rows in the first instance

  • which happens to be 2 and
  • creates an integer sequence out of it, so we’ll have a loop of 1:2 for i
  • in the nested loop (second loop within it) we take the length of the cols which is 3
  • and loops for 1:3 through the columns for j
x <- matrix(1:6,2,3)
for (i in seq_len(nrow(x))) { 
                for (j in seq_len(ncol(x))) {print(x[i,j])} 
                } 
[1] 1
[1] 3
[1] 5
[1] 2
[1] 4
[1] 6

While Loop


  • While loops begin by testing a condition, if it’s true they execute the loop.
  • Once the condition is no longer true they stop.
  • These loops can result in an infinite loop so always make sure your condition will eventually turn and stop the loop.

paste

count <- 0 
while (count <5)
        {print(paste("yasha ate",count,"cookies", sep = " ")) 
        count <- count+1 }
[1] "yasha ate 0 cookies"
[1] "yasha ate 1 cookies"
[1] "yasha ate 2 cookies"
[1] "yasha ate 3 cookies"
[1] "yasha ate 4 cookies"

multiple conditions

Here you’ll see in the first loop that it never executes because the counter is outside the conditions before we start. So if we reset count = 3 then it will run:

count <- 2
while (count >=3 && count <= 10)
        {print(paste("yasha ate",count,"cookies", sep =" ")) 
        count <- count+1 } 
        
        
count <- 5
while (count >=3 && count <= 10)
        {print(paste("yasha ate",count,"cookies", sep =" "))
        count <- count+1 }
[1] "yasha ate 5 cookies"
[1] "yasha ate 6 cookies"
[1] "yasha ate 7 cookies"
[1] "yasha ate 8 cookies"
[1] "yasha ate 9 cookies"
[1] "yasha ate 10 cookies"

while & if

  • So here, while the while condition is true and z is within the two conditions, we print it
  • Then we’ll flip a coin (random flip >> rbinom) and
  • If the flip is 1 will add 1 to z and
  • If anything else we subtract 1 from z.
  • So the result will fluctuates up and down. Let’s see:

binominal distribution

rbinom(x,y,z):

  • x is # of observations
  • y is # of observations/trials
  • z is probability of success so for coin flip we use rbinom(1,1,0.5)
  • Be careful here since the index is random we really don’t know how long it will take to end the loop.
z <- 5
while (z >=3 && z<=10)
        {print(z) 
        coin <- rbinom(1,1,0.5)
        if(coin == 1) {z <- z+1} 
                else {z <- z-1}
        }
[1] 5
[1] 4
[1] 5
[1] 6
[1] 5
[1] 4
[1] 5
[1] 6
[1] 5
[1] 6
[1] 7
[1] 6
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

Repeat


Repeat initiates an infinite loop, these are not commonly used in statistical applications, but they do have their uses. The only way to stop them is to call a break.

break

  • I initialize x0 to be 1
  • Set a tolerance to be 10 to the minus 8.
  • The loop will calculate the estimate of x and will call it x1, if the absolute value of (x1-x0) is less than the tolerance I’ll break
  • If it is not less than the tolerance I’ll set x1 into x0
  • Repeat the process
  • So we are recycling through the algorithm till they converge, but since we don’t know how long it will take to converge, but we have no clue it might take forever. So use a for loop instead.
  • Anyways, I couldn’t find the package for computeEstimate.
x0 <- 1
tol <- 1e-8
repeat {x1 <- computeEstimate() 
        if(abs(x1-x0) < tol) {break}
        else {x0 <- x1}
        }

next

Is used in any type of looping construct when you want to skip an iteration.

  • Here we have a for loop that will run for 100 iterations.
  • But what if I want to skip the first 20 iterations.
  • I can use a simple if statement with a condition and I can make the iteration skip to the {next} one till the condition I put on the skip is no longer valid then the loop will continue uninterrupted.
  • This will basically print from 21 to 100
for (i in 1:100){
        if(i<=20) {next} 
        print(i) }

return

Signals that a function should stop/exit and return a value. will look at it in Functions

Looping Functions


In this document we’ll cover Functions and Loops. Both tend to occur together most of the time. We’ll start with functions that loop by nature and don’t need a loop so to speak to make them loop through all the rows of a data frame.

Loop functions come in handy when using the command line. But because of their simplicity we tend to use them when programming as well. Here we’ll cover the following:

Anti_join()


anti_join() is a dplyr functions that returns all rows from x without a match in y

anti_join(join_station_id, station_coord)

Semi_join()


semi_join() is a dplyr functions that returns all rows from x with a match in y

semi_join(join_station_id, station_coord)

Case_match


This function allows you to vectorise multiple switch() statements. Each case is evaluated sequentially and the first match for each element determines the corresponding value in the output vector. If no cases match, the .default is used.

case_match() is an R equivalent of the SQL “simple” CASE WHEN statement.

Syntax:

case_match(.x, ..., .default = NULL, .ptype = NULL)
  • .x: A vector to match against.
  • A sequence of two-sided formulas: old_values ~ new_value. The right hand side (RHS) determines the output value for all values of .x that match the left hand side (LHS). The LHS must evaluate to the same type of vector as .x. It can be any length, allowing you to map multiple .x values to the same RHS value. If a value is repeated in the LHS, i.e. a value in .x matches to multiple cases, the first match is used. The RHS inputs will be coerced to their common type. Each RHS input will be recycled to the size of .x.
  • .default:The value used when values in .x aren’t matched by any of the LHS inputs. If NULL, the default, a missing value will be used. .default is recycled to the size of .x.

case_match

See a good example in How To: Convert Column

  • so below we have otherdf with a key column to which we are going to match
df$col2 <- case_match(
        df$col2,
        df$col2 ~ otherdf$key_column
        )

Case_when


case_when

More information is found here on case_match & case_when. Let’s say we just performed a bind_rows() or rbind() /merged two datasets together, and we want to change the values in one column to match the other values in the merged dataset.

Here we change all values to make sure we have a uniform dataset. One df had the column with values of member or casual, and a few years later the company decided to use subscriber & customer. So when we merged the two df together, we ended up with 4 values for the same column.

After merge we ended up with 4 values in one column where we need to only have two. So to replace the old values with the renamed ones, we can use case() like this to replace member with Subscriber and casual with Customer

#_________________So let's replace values in col(member_casual) that are:
#member & casual with Subscriber & Customer 
all_trips19_20 <- trim_trips19_20 |> 
        mutate(member_casual = case_when(
                member_casual == "member" ~ "Subscriber", 
                member_casual == "casual" ~ "Customer",
                member_casual == "Subscriber" ~ "Subscriber",
                TRUE ~ "Customer"    ))

lapply


The lapply(list, function, …) does a series of operations. Each of the apply functions will split up some data into smaller pieces, apply a function to each piece, then combine the results.

  • It loops over a list, iterating over each element
  • It applies a function provided by the user to each element of that list
  • Returns a LIST, hence l in apply
  • If you provide anything other than a list, it will coerce it to a list and outputs a list

read files

We can use lapply to read an entire list of files.

  • Let’s say we have a long list of files to input for analysis
  • Instead of reading each file on its own, or creating a function to loop through the entire list
  • We just create the list of files: wantedFiles
  • Use lapply() to read the entire list
  • Remember lapply ALWAYS gives the output in a list, so the output below is a list of dfs one corresponding to each file read from the list
dataIn <- lapply(wantedFiles, read.csv)

mean

Let’s lapply() to find the mean of two lists a, b

x <- list(a = 1:5, b = rnorm(10))
x
$a
[1] 1 2 3 4 5

$b
 [1] -0.09554446  0.09151091  0.08724459 -0.16105597  0.03342112 -0.37277600
 [7]  0.96397316 -0.57498766 -3.05720241  1.34645172
lapply(x, mean)
$a
[1] 3

$b
[1] -0.1738965

Let’s create a matrix

x <- list(a = matrix(1:6, 2, 3), b = matrix(1:9, 3, 3))
x
$a
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

$b
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Anonymous function

extract column

Let’s pull the second column from both matrices using lapply and create an anonymous function (one that only exists in this environment as we are about to make it up now )

lapply(x, function(elf) { elf[,2]})
$a
[1] 3 4

$b
[1] 4 5 6

mean of each subgroup

Let’s use the default dataset iris and find

  • Mean of Sepal.Length
  • Mean of the Sepal.Length of the Viriginica specie

See tapply() below for another answer to solve the same question.

mean(iris$Sepal.Length)
lapply(y, function(elf){mean(elf[,"Sepal.Length"])})

sapply


This will return a simplified result. Look at split example, now let’s use sapply instead of lapply:

sapply(s, function(santa){ colMeans(santa[ ,c("Ozone","Wind","Temp")])})  

Let’s say we want to go through the entire list and split all the Names. We can use split inside sapply within a function that we can call that splits all the column names and returns the first row in the list [1]

first <- function(santa){santa[1]}
sapply(splitNames, first)

na.rm = TRUE

As you see above some rows have NA so not all columns can be calculated. Let’s remove NAs

sapply(s, function(santa){ colMeans(santa[ ,c("Ozone","Wind","Temp")], na.rm = TRUE)}) 

apply


Is used to evaluate a function over the margins of an array. Often to apply a function to the rows or columns of a matrix (which is a 2 dimensional array). For example you can take the average of an array of matrices.

Syntax:apply(x, margin, fun, …)

  • x is an array
  • margin is an integer vector indicating which margins should be retained. In other words, since we are always working with an array, we want to tell apply() whether we want to perform row or column statistics. MARGIN specifies which one we want to preserve or retain. So when we want to calculate the mean of the column we specify 2 (rows, columns). In other words we want to retain the structure of the column (2) and collapse the rows to preserve the number of columns, so we’ll have the mean of every column. The opposite if I want to take the mean of each row, that means I wan to preserve each row (1) and collapse all the columns
  • fun is the function to perform
  • … args to be passed to fun
# we preserve 2=columns that mean we get the mean of each column
x <- matrix(rnorm(200), 20, 10)
apply(x, 2, mean) 
 [1]  0.05414183 -0.04547356  0.07998758  0.14985420  0.18250894  0.01788141
 [7] -0.11344563  0.02771391 -0.06979407  0.29995799
# now we are preserving the rows so we are taking the mean of each row 
apply(x, 1, mean) 
 [1]  0.40544838  0.41852062  0.88402390  0.43815163  0.15892825  0.47504460
 [7] -0.11451130  0.44380040  0.09284149  0.20627488 -0.55106774  0.06142513
[13] -0.20506548  0.29201586 -0.41450291 -0.17892319 -0.05640784 -0.52402650
[19] -0.37012584 -0.29517914

What if we want to a vector of the means of the variables ‘Sepal.Length’, ‘Sepal.Width’, ‘Petal.Length’, and ‘Petal.Width’ from the iris dataset?

apply(iris[, 1:4], 2, mean)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    5.843333     3.057333     3.758000     1.199333 

tapply


Is a combination of split() and sapply() for vectors only. tapply() is used to apply a function over a subset of a vector. Here you’ll split your data up into a group based on the value of some variable, then apply a function to the members of the group.

syntax: tapply(x, INDEX, FUN, …, simplify) where

  • x is a vector
  • INDEX is a factor or a list of factors.
  • FUN function to be applied
  • …. other arguments to be passed to the function
  • simplify: should we simplify the results
  • If we want to calculate the mean for each spray type, we use tapply and sum for each spary type/column needed
tapply(InsectSprays$count, InsectSprays$sprays, sum)

factor


Refer to Factors – R for Data Science (2e) for more information.

Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.

We’ll start by motivating why factors are needed for data analysis and how you can create them with factor().

Base R provides some basic tools for creating and manipulating factors. We’ll supplement these with the forcats package, which is part of the core tidyverse. It provides tools for dealing with categorical variables (and it’s an anagram of factors!) using a wide range of helpers for working with factors.

gl() is used to define groups

## Simulate some data
x <- c(rnorm(10), runif(10), rnorm(10, 1))

To define some groups with a factor variable we use k=10 to match the length of x which is 30

f <- gl(3, 10)
f
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
tapply(x,f,mean)
         1          2          3 
-0.4400803  0.4458221  0.9380902 

create list of factors

To create a factor you must start by creating a list of the valid levels:

x1 <- c("Dec", "Apr", "Jan", "Mar")

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

y1 <- factor(x1, levels = month_levels)
y1
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Sort outcome

sort(y1)
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Silent NAs

  • If we use the same process on X2 using the same levels as before
  • notice we get NA is inserted in the 3rd position where Jam wasn’t found in the levels
  • so any value not in the level is silently converted to NA
x2 <- c("Dec", "Apr", "Jam", "Mar")

y2 <- factor(x2, levels = month_levels)
y2
[1] Dec  Apr  <NA> Mar 
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Omit levels

  • What happens if we omit levels all together
  • It sorts them alphebatically
factor(x1)
[1] Dec Apr Jan Mar
Levels: Apr Dec Jan Mar

count of each group

We’ll use the flags dataset here again. The “landmass” variable takes on integer values between 1 and 6.  Let’s use the first line of code to see how many flags fall into each group

table(flags$landmass)
1  2  3  4  5  6  
31 17 35 52 39 20
table(flags$animate) 
#__this tells us how many flags contain an animated object in their flag
0   1  155  39 

sub within group

What if want to calculate the mean to the animate variable separately for each of the six landmass groups, that way we find out which portion of the flags contain an animate image WITHIN each landmass group.

So we will place the animate group first in the function followed by the landmass group like this:

tapply(flags$animate, flags$landmass, mean) 
     1         2         3         4         5         6
0.4193548 0.1764706 0.1142857 0.1346154 0.1538462 0.3000000
#___ the first landmass is 1 = N. 
# America contains the highest portion of flags with an animate image

sub within group

What if we want to find the population (in millions) for countries with and without the color red on their flags? In the output below, note that 0 indicates flags WITHOUT red in it and 1 for flags WITH red in it.

tapply(flags$population, flags$red, summary)

$`0`
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.00    0.00    3.00   27.63    9.00  684.00 

$`1` 
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.0     0.0     4.0    22.1    15.0  1008.0 

population for landmasses

What if we want to find the population (in millions) for all landmasses?

 tapply(flags$population, flags$landmass, summary)
 
 $`1` 
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.00    0.00    0.00   12.29    4.50  231.00
 
 $`2` 
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.00    1.00    6.00   15.71   15.00  119.00
 
 $`3`
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.00    0.00    8.00   13.86   16.00   61.00
 
 $`4`
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.000   1.000   5.000   8.788   9.750  56.000
 
 $`5`
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.00    2.00   10.00   69.18   39.00 1008.00 
 
 $`6`
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.00    0.00    0.00   11.30    1.25  157.00

mpg per cylinder

Using the mtcars dataset, we want to calculate the average miles per gallon (mpg) by number of cylinders in the car (cyl). Here multiple ways this can be done:

tapply(mtcars$mpg, mtcars$cyl, mean)
sapply(split(mtcars$mpg, mtcars$cyl), mean) 

with

Watch how the same code above can be written using with()- with() can only be used with dataframes.

  • We need the average of mpg for each cylinder group.
with(mtcars, tapply(mpg, cyl, mean))

hp per cylinder

Using the mtcars dataset again, what’s the difference in hp between 4 and 8 cylinders?

tapply(mtcars$hp, mtcars$cyl, mean)
   4         6         8 
82.63636 122.28571 209.21429 

subgroup within group

Let’s use the default dataset iris and find

  • Mean of Sepal.Length
  • But of the Viriginica specie See lapply() above for another answer to solve the same question.
tapply(iris$Sepal.Length, iris$Species, mean) 
setosa versicolor  virginica 
5.006      5.936      6.588 

mapply


Is kinda of a fancy apply, it applies a function in parallel over a set of arguments. lapply and the others iterate over a single R object, mapply iterates over multiple R objects in parallel.

  • The mapply() function can be use to automatically “vectorize” a function.
  • What this means is that it can be used to take a function that typically only takes single arguments and create a new function that can take vector arguments.
  • This is often needed when you want to plot functions. I won’t discuss that here.

syntax: mapply(fun,… ,moreargs, simplify)

#____Here is a tedious way to create a list
list(rep(1, 3), rep(2, 2), rep(3, 1)) 
[[1]]
[1] 1 1 1

[[2]]
[1] 2 2

[[3]]
[1] 3
#another way of doing the same
mapply(rep, 1:3,3:1) 
[[1]]
[1] 1 1 1

[[2]]
[1] 2 2

[[3]]
[1] 3

vapply


Allows you to specify explicitly the format of the results

  • if the result doesn’t match the format you specify,
  • vapply() will throw an error, causing the operation to sto.
  • Look at the first line of code below, we’ve used that before in l and sapply(), now you know that it returns a list of values that are unique, what if you had forgotten and thought it returns the “number” of unique values.
sapply(flags, unique)

Now let’s suppose we want to assign a numeric vector of length (1) to the returned value of unique, let’s use vapply(), look at the code below and let’s see what the output is

vapply(flags, unique, numeric(1))
Error in vapply(flags, unique, numeric(1)) : values must be length 1,
but FUN(X[[1]]) result is length 194

Now let’s try it again with something else, we already displayed the class() of all elements in flags with sapply() let’s do it again

sapply(flags, class) 
name    landmass        zone        area  population    language    religion        bars     stripes     colours         red  "character"   "integer"   "integer"   "integer"   "integer"   "integer"   "integer"   "integer"   "integer"   "integer"   "integer"        green        blue        gold       white       black      orange     mainhue     circles     crosses    saltires    quarters    "integer"   "integer"   "integer"   "integer"   "integer"   "integer" "character"   "integer"   "integer"   "integer"   "integer"     sunstars    crescent    triangle        icon     animate        text     topleft    botright    "integer"   "integer"   "integer"   "integer"   "integer"   "integer" "character" "character" 

It is obvious from the output that we have many different classes for all the columns in the data set. What if we want to assign all the classes to character. Let’s see in the code how we assign the class to being a character vector of length 1

vapply(flags, class, character(1)) 
name    landmass        zone        area  population    language    religion        bars     stripes     colours         red  "character"   "integer"   "integer"   "integer"   "integer"   "integer"   "integer"   "integer"   "integer"   "integer"   "integer"        green        blue        gold       white       black      orange     mainhue     circles     crosses    saltires    quarters    "integer"   "integer"   "integer"   "integer"   "integer"   "integer" "character"   "integer"   "integer"   "integer"   "integer"     sunstars    crescent    triangle        icon     animate        text     topleft    botright    "integer"   "integer"   "integer"   "integer"   "integer"   "integer" "character" "character" 

split


split(x, f, drop)

  • x is a vector or list or df
  • f is a factor or a list of factors
  • drop declares whether empty factors should be dropped

split() is usually used with lapply().

  • The idea is you take a data structure, split it into subsets then perform lapply over those subsets.
  • It kinda reminds me of group_by so you can calculate the mean or whatever you want to calculate on the groups.
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f <- gl(3, 10)
split(x, f)
$`1`
 [1]  0.30690859 -0.07689829 -0.32973287  1.87096565 -0.12805030  0.94880053
 [7]  0.09686608 -0.04668742 -1.16425495 -0.29149315

$`2`
 [1] 0.47301698 0.12071165 0.09234822 0.47730813 0.50042387 0.47376869
 [7] 0.75344793 0.10455063 0.02543205 0.06343151

$`3`
 [1]  0.78452350  1.20306303  0.54648064  0.03639915  1.80512956  0.68396469
 [7]  2.03985863  0.53653137  1.92598369 -0.24756685

Or something like this:

lapply(split(x, f), mean)
$`1`
[1] 0.1186424

$`2`
[1] 0.308444

$`3`
[1] 0.9314367

As we stated above, split will break the data into groups and then you can perform some kind of calculation over the groups using lapply

spIns =  split(InsectSprays$count, InsectSprays$sprays)
spCount = lapply(spIns, sum)

Split df


  • We used the built-in dataset “airquality” earlier let’s use it here with split().
  • We want to split it by the Month variable so we have a separate subset frames for each month.
  • Kinda like group_by
library(datasets)
head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

split by month

s <- split(airquality,airquality$Month)

mean of columns

lapply(s, colMeans)
$`5`
   Ozone  Solar.R     Wind     Temp    Month      Day 
      NA       NA 11.62258 65.54839  5.00000 16.00000 

$`6`
    Ozone   Solar.R      Wind      Temp     Month       Day 
       NA 190.16667  10.26667  79.10000   6.00000  15.50000 

$`7`
     Ozone    Solar.R       Wind       Temp      Month        Day 
        NA 216.483871   8.941935  83.903226   7.000000  16.000000 

$`8`
    Ozone   Solar.R      Wind      Temp     Month       Day 
       NA        NA  8.793548 83.967742  8.000000 16.000000 

$`9`
   Ozone  Solar.R     Wind     Temp    Month      Day 
      NA 167.4333  10.1800  76.9000   9.0000  15.5000 

certain columns

What if we want just the mean of Wind column

lapply(s, function(santa){mean(santa[,"Wind"])})
$`5`
[1] 11.62258

$`6`
[1] 10.26667

$`7`
[1] 8.941935

$`8`
[1] 8.793548

$`9`
[1] 10.18

multiple columns

Just swap out the column name with a list of names

lapply(s, function(santa){ colMeans(santa[ ,c("Wind","Temp")])})
$`5`
    Wind     Temp 
11.62258 65.54839 

$`6`
    Wind     Temp 
10.26667 79.10000 

$`7`
     Wind      Temp 
 8.941935 83.903226 

$`8`
     Wind      Temp 
 8.793548 83.967742 

$`9`
 Wind  Temp 
10.18 76.90 

split df

  • I have a df with 3000 rows with a column named States.
  • I want to know how many unique States are in the df
  • How many times does each state appear in the rows
  • Instead of using group_by I’ll split it on State
  • I’ll end up with a list of 54 dfs and a count in each df (nrow) which is the occurrence of each State in the total dataset
  • I confirmed the values with group_by
  • gooddata is the data
  • gooddata$State is the column using to split
  • count is the number of rows in each group
#use split to create 54 df and a count of occurences in each 
splitdata <- sapply(split(gooddata, gooddata$State),count)

count

You’re probably thinking what’s count doing in this section? well we can figure out the answer to the above question in another way without having to split the dataset.

This code will create a list of all States and the Count of how many times they occured in the rows, and sort them by State in ascending order (default)

#creates a list of states and counts
testcount <- gooddata |>
        count(State, sort = TRUE)

Col/Row Functions


These are self explanatory as to what each accomplishes

rowSums

rowSums = apply(x, 1, sum)

rowMeans

rowMeans = apply(x, 1, mean)

colSums

colSums = apply(x, 2, sum)

colMeans

colMeans = apply(x, 2, mean)