Functions


  • Functions are very important part of R.
  • You can use a text editor to create a function.
  • Functions allow the user to accomplish more complicated calculations.
  • Repetitive calculations and complicated processes can be grouped in a functions and set aside from the main code to simplify reading and understanding the main purpose of the code.

Functions are a body of reusable code used to perform specific tasks in R. Functions begin with function names like print or paste, and are usually followed by one or more arguments in parentheses.  For information you can type ?functionName

I’ll start by opening a new R script file in Rstudio and create my first function.

boring_function('My first function')

documentation

If you want to read the documentation for a function or a package or … just type

?paste

list args

To list the arguments of a function just use

args(boring_function)

view code

To view the code of a function just use the function name without the

boring_function

Let’s start with some general definitions before we dive into examples of functions.

R Objects


  • Functions are created using the function() directive and are stored as R object, they are objects of class “function”.
  • They are first class objects.
  • You can pass functions to other functions and can define functions within functions.
  • The return value of a function is the last expression in the function body.

Arguments

  • Arguments are variables that are either declared or passed into a function.
  • An argument is information that a function in R needs in order to run.
  • First let’s see how many arguments does lm() have, we use the args(lm) code below, and as you see the list is extensive:
args(lm)
function (formula, data, subset, weights, na.action, method = "qr", 
    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
    contrasts = NULL, offset, ...) 
NULL

Argument Matching

Functions have named arguments which potentially have default values.

  • The formal arguments are the arguments included in the function definition
  • The formals() function returns a list of all the formal arguments of a function
  • Not every function call in R makes use of all the formal arguments
  • Function arguments may be missing or have default values

name matching

R function arguments can be matched positionally or by name. - Here below are all equivalent calls to sd() the standard deviation function that takes two inputs: x which could be a vector, and na.rm which controls whether we want to omit the NA values and calculate still despite NA in the vector. - The default value for na.rm is FALSE so if we miss it it won’t give us an error. - If we name the argument mydata to x we can name it in the function call. - If we name the arguments we can switch their order.

mydata <- rnorm(100)  #here we set the data vector
sd(mydata)
[1] 1.121947
sd(x = mydata)
[1] 1.121947
sd(x = mydata, na.rm = FALSE)
[1] 1.121947
sd(na.rm = FALSE, x = mydata)
[1] 1.121947
sd(na.rm = FALSE, mydata)
[1] 1.121947

positional matching

You can mix positional matching with matching by name, like we did above. - If we name all arguments except one, then R will match the unamed one to the last remaing argument. - If we take one named argument out of order, then R will match all remaining arguments positionally - This comes in handy when a function has a long list or arguments and we don’t remember the order, so we just list the ones we know and match them my name and let R match the rest positionally.

Now let’s call the function lm() in a couple of ways (remember we already listed arguments for lm(): - In the first one we put the data first, and we named it so R recognizes it, then - We continued with positional matching by putting the formula second, and it will give it to the first unmatched argument then * subset is next in the position list, but we used model = TRUE instead which is name matching - But as you recall subset is still next in the position list, so we put our subset here as 1:100 and - We omitted the rest of the arguments, as some are not necessary, and some have default values.

In the second call: - We positionally matched the formula, data, and subset, then - We skipped all the way to name match model, and skipped the rest

lm(data = mydata, y - x, model = FALSE, 1:100)
lm(y-x, mydata, 1:100, model = FALSE)

partial match

Sometimes in the command line we want to be quick so we type partial argument names. If there is a unique match to the partial R will assign it, just be careful if you have multiple arguments that match the first 2-3 letters. R will match the first unique one it encounters. If not found then it will match them positionally.

lazy evaluation

Sometimes a function has several arguments, but in fact only uses one or two. So what happens to the 3, 4, 5th arguments… if it only uses 2 arguments. Well, if we pass it 2 arguments, the function is satisfied since it only uses 2 arguments.

… argument

The … argument indicates a variable number of arguments that are usually passed on to other functions. It is used when extending another function and you don’t want to type the entire argument list. This often happens in plotting when those functions have a long extensive list of arguments. So we just enter the ones we are using and use … for the rest.

Any arguments that appear after the … on the argument list MUST be named explicitly and not matched partially or positionally.

myplot <- function(x, y, type = 'l', ...){
        plot(x,y, type = type, ...)}

Many times, we use … when the number of arguments cannot be known in advance, for example if we write a function for pasting or concatenation, we don’t really know how many arguments the user is planning to paste, concatenate or other …. so, as you see below; the list of arguments for paste start with … since that indicates the place where the number of arguments to be pasted goes. You’ll see the same for the list of concatenate cat()

args(paste)
function (..., sep = " ", collapse = NULL, recycle0 = FALSE) 
NULL
args(cat)
function (..., file = "", sep = " ", fill = FALSE, labels = NULL, 
    append = FALSE) 
NULL

Symbol Binding


How does R know which value to assign to which symbol? How does it know what value to assign to lm? Why doesn’t it assign the value to lm that’s in the stats package? So when I call lm at another part of my code which one does it use?

Let’s create a new function and call it lm(), so you see that could cause an issue since there is already another function called lm(). What happens is R uses environment binding

lm <- function(x) { x * x}

environment binding

When trying to bind a value to a symbol, it searches through a series of environments to find the appropriate value. Here look at this, when we just type search() in R here is the output

search()
[1] ".GlobalEnv"        "package:stats"     "package:graphics" 
[4] "package:grDevices" "package:utils"     "package:datasets" 
[7] "package:methods"   "Autoloads"         "package:base"     

So, what happens when we are trying to search for something, R will look in the list above in the order listed above:

  1. Starts with the global environment. Global means the code we are working on, the workspace where we made all the definitions up to this point. So here where R will start.
  2. If no match in global, then it will move to the packages installed, and you see fourth is the stats package where the second ln is
  3. All the way down to the base package
  4. It’s important to load packages in the right order if you think there might be conflicts
  5. It is also almost impossible to know what variables are in each package

Free Variables


In R, a free variable is a symbol that appears in a function but is not a formal argument or a local variable defined within the function body. These free variables are not explicitly passed as arguments to the function, yet they influence its behavior. The scoping rules in R determine how values are associated with these free variables.

Lexical Scoping

Lexical scoping means that the values of free variables are searched for in the environment in which the function was defined. The parent FRAME is the environment in which a function was defined.

Scoping rules are what makes R different than S language it’s parent language. So, what are the scoping rules? - Scoping rules determine how a value is associated with a free variable in a function - R uses lexical scoping or static scoping. A common alternative is dynamic scoping - This is how R uses the search list to bind a value to a symbol - Lexical scoping turns out to be particularly useful for simplifying statistical computations

Look at the function below: - It has 2 formal args x and y - In the body there is another var z? - In this case z is called a free var - The scoping rules of a language define how values are assigned to free vars. - Free vars are not formal args, and are not local args (assigned inside the function body)

f <- function(x, y){ x^2 + y / z }

environment

Let’s define environment as: - An environment is a collection of (symbol, value) pairs, i.e. x is a symbol and 3.14 might be its value - Every environment has a parent environment, it’s possible for an env to have multiple children - The only env without a parent is an empty environment - A function + an environment = a closure or function closure

lexical rules

Similar to symbol binding above R wil follow these steps searching for the value of a free var - Look in the environment first, if not found move to - Parent environment (global) which is the Top level is where it goes next, if not found - Search continues down the list till it finds the value, or hits the empty environment then if still not found it gives an error.

Here below I’ll create a function make.power, and in it I’ll use another function to pass the parent functioin a value. - If you look inside the function you see another functioin pow - The child function has a free variable n that’s not defined in pow (the child function/inner function) - n is defined in the outer/parent function as it is the argument of make.power function - The value of n is being passed when we call it make.power(3) and here n = 3 - As you see we use pow inside the outer function to return the value, so in a sense we are returning a function as a value for another function. - So if we call make.power(3) it will return a function pow as you see below

make.power <- function(n){
                        pow <- function(x){x^n} 
                        pow}
make.power(3)
function(x){x^n}
<environment: 0x0000027117e917b0>
  • Let’s create a function(x)
  • Now let’s assign that function to let’s say cube
  • And let’s say we assign another function to square using n = 2
  • Then let’s call those functions
function(x){x^n}
function(x){x^n}
cube <- make.power(3)
square <- make.power(2)
cube(3)
[1] 27
square(3)
[1] 9

function closure

How do we know the environment of a function? We can use the ls() function to list the objects in the environment

ls(environment(cube))
[1] "n"   "pow"
get("n", environment(cube))
[1] 3

optimization

Optimization is to write a function in a clean, kind of readable manner. Sometimes you fix a certain parameter and optimize over the other parameters.

Function as Argument


You can pass a function as an argument to another function just like you can pass data to a function. Hence Constructor function. If it’s too complicated at first look at the examples that follow which takes you step by step.

constructor function

The constructor function constructs the objective function.

If you are going to use a complicated function where most of the calculations and number crunching is done with only minimal input, then you can create one function inside another. The hopes are that the inside function does most of the heavy lifting, using numerous arguments and calculations that need to be done regardless of what values are passed. This way instead of calling the parent function with a whole list of arguments, we can just limit that input and still not give up functionality.

negLogLik

Here is the negative log likelihood function being minimized: - Inside the constructor function I define function which takes an argument called p for the parameters. - This is going to be the parameter vector that I want to optimize over. - So basically what this function’s going to do is going to return neglog-likelihood for a normal distribution and I’m going to want to fit my data to this normal distribution. - We know that a normal distribution has two parameters, the mean, mu, and a standard deviation, sigma. So those are going to be the two parameters that I want to optimize over - I’m just defining, the log-likelihood, and taking the negative of it, so I can minimize it. - The constructor function does is returns the function as the return value.

make.NegLogLik <- function(data, fixed=c(FALSE,FALSE)) {
                        params <- fixed 
                        function(p) {
                                params[!fixed] <- p 
                                mu <- params[1]
                                sigma <- params[2]
                                a <- -0.5*length(data)*log(2*pi*sigma^2)
                                b <- -0.5*sum((data-mu)^2) / (sigma^2) 
                                -(a + b)         }
                        } 
set.seed(1); normals <- rnorm(100, 1, 2)
nLL <- make.NegLogLik(normals)
nLL
function(p) {
                                params[!fixed] <- p 
                                mu <- params[1]
                                sigma <- params[2]
                                a <- -0.5*length(data)*log(2*pi*sigma^2)
                                b <- -0.5*sum((data-mu)^2) / (sigma^2) 
                                -(a + b)         }
<bytecode: 0x000002711924dd20>
<environment: 0x0000027118c9bab0>
ls(environment(nLL))
[1] "data"   "fixed"  "params"

function(func)

Let’s create a function so if a function is passed into the func argument, and some data (like a vector) is passed into the dat argument, the evaluate() function will return the result of dat being passed as an argument to func.

In other words func() will process dat and pass the results to the outer function: evaluate()

So let’s just worry about creating the outer function for now, and we’ll create the constructor function later, for now we’ll use an existing function to pass into the outer function which we’ll call evaluate()

evaluate <- function(func, dat){ 
        # Remember: the last expression evaluated will be returned! 
        func (dat) }

# now let's pass a function as func, let's say we use standard deviation: sd()
# and let's pass it a vector for dat: c(1.4,3.6,7.9,8.8)
evaluate(sd, c(1.4,3.6,7.9,8.8))
[1] 3.514138

anonymous functions

In the example above we passed sd() as the argument func. sd() is already a predefined function. But what if we want to use our own function instead?

Well if you think about it we just did it with evaluate() we created our own function inside of it func() and we passed it arguments. One of the arguments happens to be a predefined function sd().

Now let’s make up our own function and define it on the go. Remember that evaluate(func, dat) takes two arguments (func, dat), the left argument/first argument is a function. So instead of passing it a pre-existing function let’s say we want to perform x+1 as the function.

As you already know that a function code is: function(x){ x + 1 }, so if this is the function we want to use and hasn’t been defined in advance “anonymous function” just put in the first argument of evaluate(func,dat) and supply a second argument for dat, let’s say 6 like this:

evaluate(function(x){x+1},6)
[1] 7

Binary Operator

If you intend to perform a calculation over and over again, as opposed to a function, if the calculation is short and simple, you can create a binary operator all for yourself!! Let’s say I wanted to create an operator that multiplied two numbers and added one to the result.

The syntax is simple: “%whatever%” with whatever representing whatever

binary operator 1

Let’s create this animal:

"%timesadd%" <- function(left, right){ 
                        (left * right) + 1 } 
# It isn't necessary to use parenthesis since the multiplication is performed first
4 %timesadd% 3
[1] 13

binary operator 2

Let’s create one that pastes instead of using the paste () function and all it’s arguments.

"%p%" <- function(left, right){
                   paste(left,right)}
"Yasha" %p% "loves" %p% "His Chicken!"
[1] "Yasha loves His Chicken!"

Coding Standards


  1. Use a text editor
  2. Indent your code
  3. Limit the width (80 columns)
  4. Limit length of function

Examples


source code

To see the source code of a function just type the function without ()

boring_function

mean

Implement a function that returns the mean of a vector. Use the sum() and length() functions to achieve that:

my_mean <- function(my_vector) {
                x <- sum(my_vector)
                y <- length(my_vector)
                x/y }
my_mean(c(4, 5, 10))
[1] 6.333333

default arguments

You’re going to write a function called “remainder” where remainder() will take two arguments: “num” and “divisor” where “num” is divided by “divisor” and the remainder is returned.

Imagine that you usually want to know the remainder when you divide by 2, so set the default value of “divisor” to 2.

Please be sure that “num” is the first argument and “divisor” is the second argument. - Hint #1: You can use the modulus operator %% to find the remainder. - Example: 7 %% 4 evaluates to 3.

remainder <- function(num, divisor = 2) { 
                num %% divisor }
remainder(5)
[1] 1
remainder(11,5)
[1] 1
remainder(divisor = 11, num = 5)
[1] 5
#to list the arguments of a function
args(remainder) 
function (num, divisor = 2) 
NULL

simple f

This function is just going to take two numbers, add them and return the answer.

add2 <- function(x,y){ x + y }
  • I didn’t have to do anything special to return the answer
  • R returns the value of the last expression and since the last expression is the addition (the only expression), that gets returned
  • Now let’s run the above code so R knows the function before we use it
  • Now let’s test it
add2(3,8)
[1] 11

subset f

This function will return a subset of numbers that are greater than 10 given a vector. Let’s call it above10.

above10 <- function(x){
                use <- x>10
                x[use] }
#so as you see the above code tests if x is > 10 and sets use to either TRUE or FALSE,
#then we subset the TRUE values into a new vector and return it

func inside f

As we already explained earlier on this page how to use a func inside a f, here we’ll use one to extract a subset as well. Do you remember

evaluate <- function(func, dat){
                func (dat) }
# this call will take 6 and add 1 to it
evaluate(function(x){x+1},6) 
[1] 7

Again remember everything to the left of the , is the first argument, and we can pass it any function sd, mean, sum or an anonymous fuction we just created as shown her to extract the first element of the vector that’s supplied as the second argument here c(8,4,0)

evaluate(function(x){x[1]},c(8,4,0))
[1] 8

What if we want the last element of the vector?

evaluate(function(x){x[length(x)]},c(8,4,0))
[1] 0

manual f

Let’s make the above function more useful and allow the user to specify both the conditional value as well as the input vector

above <- function(x,n){
                use <- x > n
                x[use] }

Let’s test it with:

#set x first
x <- 1:25
above(x,13)
 [1] 14 15 16 17 18 19 20 21 22 23 24 25

default values f

Let’s say we want to guard against the user forgetting to input the conditonal value for n. Or let’s say 10 is expected to be used 95% of the time and we expect users to overlook it. Let’s set a default value for n. How do we do that? Same as before except for one addition:

above <- function(x,n = 10){ 
                use <- x > n 
                x[use] }

Let’s test it without a value for n and see what happens:

x <- 1:25
above (x)
 [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

ncol mean()

This one is a little more complicated as we take our argument, loop through each column and calculate the mean for each one. * Let’s call it colmean * Input will be a matrix * We go through each column and calculate the mean of each col * Return all the means for all the columns

colmean <- function(y){ 
                nc <- ncol(y)            #ncol gives us the number of columns 
                means <- numeric(nc)     #since we have multiple columns we need to store the mean of each in a vector,                                                     and the length of the vector is the same as (nc)
                for (i in 1:nc){ 
                        means[i] <- mean(y[,i])}
                means }  
# so we assign the vector[i] the mean of that column. mean() function  or all rows and
# column i mean(y[,i]) hence [,i] then return the means

Let’s test it with the airquality dataset

colmean(airquality)
[1]        NA        NA  9.957516 77.882353  6.993464 15.803922

You will notice that the first 2 columns give us NA, because if a column has an NA you cannot calculate the mean of that column. Matrix has 6 columns total.

removeNA

Let’s say we want to still calculate the mean of the rows that had values in those two columns from the above example? * Set a default parameter to removeNA just as we did earlier * Mean() accepts a na.rm parameter so set it there as well * See below

colmean <- function(y, removeNA = TRUE){
                nc <- ncol(y)            #ncol gives us the number of columns
                means <- numeric(nc)     #since we have multiple columns we need to store
                                         #the mean of each in a vector, and the length of
                                         #the vector is the same as (nc) 
                for (i in 1:nc){ 
                        means[i] <- mean(y[,i], na.rm = removeNA)}
                means }

Test it again on the same dataset

colmean(airquality)
[1]  42.129310 185.931507   9.957516  77.882353   6.993464  15.803922

There you get the mean of whatever values are in that first column. If you want to override it and not calculate the means of the first two columns, then send a FALSE in the arguments like this:

colmean(airquality,FALSE) 
[1]        NA        NA  9.957516 77.882353  6.993464 15.803922

… paste f

Usually the ellipses are either the first or last argument for a function. Remember every argument following the … MUST be named. We’ll work with paste as examples: If we want to set a default value to a paste function which has the following args list paste (…, sep = ” “, collapse = NULL) We can set it at the start where the … are since we don’t know what the … include we can do it like this: simon_says <- function(…){ paste(”Simon says:“, …) } So let’s make our own telegram function which start and end with START and STOP

telegram <- function(...){ 
                paste("START", ... , "STOP") }
telegram("Hello Yasha!")
[1] "START Hello Yasha! STOP"

unpacking … f

  • We’ll first unpack the ellipsis into a list
  • Assign the list to a variable
  • Assume that there are two named args within the list “alpha” & “beta”
  • Extract the named args into their own variables
  • Execute the calculation we desire, one plus the other
mad_libs <- function(...){   # Do your argument unpacking here!
                args <- list(...)
                place <- args[["place"]]
                adjective <- args[["adjective"]]
                noun <- args[["noun"]]
                paste("News from", place, "today where", adjective,
                      "students took to the streets in protest of the new",
                      noun, "being installed on campus.") }
mad_libs("x","x","x")
[1] "News from  today where  students took to the streets in protest of the new  being installed on campus."