args(lm)
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
NULL
Functions are a body of reusable code used to perform specific tasks in R. Functions begin with function names like print or paste, and are usually followed by one or more arguments in parentheses. For information you can type ?functionName
I’ll start by opening a new R script file in Rstudio and create my first function.
boring_function('My first function')
If you want to read the documentation for a function or a package or … just type
?paste
To list the arguments of a function just use
args(boring_function)
To view the code of a function just use the function name without the
boring_function
Let’s start with some general definitions before we dive into examples of functions.
function (formula, data, subset, weights, na.action, method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...)
NULL
Functions have named arguments which potentially have default values.
R function arguments can be matched positionally or by name. - Here below are all equivalent calls to sd() the standard deviation function that takes two inputs: x which could be a vector, and na.rm which controls whether we want to omit the NA values and calculate still despite NA in the vector. - The default value for na.rm is FALSE so if we miss it it won’t give us an error. - If we name the argument mydata to x we can name it in the function call. - If we name the arguments we can switch their order.
You can mix positional matching with matching by name, like we did above. - If we name all arguments except one, then R will match the unamed one to the last remaing argument. - If we take one named argument out of order, then R will match all remaining arguments positionally - This comes in handy when a function has a long list or arguments and we don’t remember the order, so we just list the ones we know and match them my name and let R match the rest positionally.
Now let’s call the function lm() in a couple of ways (remember we already listed arguments for lm(): - In the first one we put the data first, and we named it so R recognizes it, then - We continued with positional matching by putting the formula second, and it will give it to the first unmatched argument then * subset is next in the position list, but we used model = TRUE instead which is name matching - But as you recall subset is still next in the position list, so we put our subset here as 1:100 and - We omitted the rest of the arguments, as some are not necessary, and some have default values.
In the second call: - We positionally matched the formula, data, and subset, then - We skipped all the way to name match model, and skipped the rest
Sometimes in the command line we want to be quick so we type partial argument names. If there is a unique match to the partial R will assign it, just be careful if you have multiple arguments that match the first 2-3 letters. R will match the first unique one it encounters. If not found then it will match them positionally.
Sometimes a function has several arguments, but in fact only uses one or two. So what happens to the 3, 4, 5th arguments… if it only uses 2 arguments. Well, if we pass it 2 arguments, the function is satisfied since it only uses 2 arguments.
The … argument indicates a variable number of arguments that are usually passed on to other functions. It is used when extending another function and you don’t want to type the entire argument list. This often happens in plotting when those functions have a long extensive list of arguments. So we just enter the ones we are using and use … for the rest.
Any arguments that appear after the … on the argument list MUST be named explicitly and not matched partially or positionally.
Many times, we use … when the number of arguments cannot be known in advance, for example if we write a function for pasting or concatenation, we don’t really know how many arguments the user is planning to paste, concatenate or other …. so, as you see below; the list of arguments for paste start with … since that indicates the place where the number of arguments to be pasted goes. You’ll see the same for the list of concatenate cat()
How does R know which value to assign to which symbol? How does it know what value to assign to lm? Why doesn’t it assign the value to lm that’s in the stats package? So when I call lm at another part of my code which one does it use?
Let’s create a new function and call it lm(), so you see that could cause an issue since there is already another function called lm(). What happens is R uses environment binding
When trying to bind a value to a symbol, it searches through a series of environments to find the appropriate value. Here look at this, when we just type search() in R here is the output
[1] ".GlobalEnv" "package:stats" "package:graphics"
[4] "package:grDevices" "package:utils" "package:datasets"
[7] "package:methods" "Autoloads" "package:base"
So, what happens when we are trying to search for something, R will look in the list above in the order listed above:
In R, a free variable is a symbol that appears in a function but is not a formal argument or a local variable defined within the function body. These free variables are not explicitly passed as arguments to the function, yet they influence its behavior. The scoping rules in R determine how values are associated with these free variables.
Lexical scoping means that the values of free variables are searched for in the environment in which the function was defined. The parent FRAME is the environment in which a function was defined.
Scoping rules are what makes R different than S language it’s parent language. So, what are the scoping rules? - Scoping rules determine how a value is associated with a free variable in a function - R uses lexical scoping or static scoping. A common alternative is dynamic scoping - This is how R uses the search list to bind a value to a symbol - Lexical scoping turns out to be particularly useful for simplifying statistical computations
Look at the function below: - It has 2 formal args x and y - In the body there is another var z? - In this case z is called a free var - The scoping rules of a language define how values are assigned to free vars. - Free vars are not formal args, and are not local args (assigned inside the function body)
Let’s define environment as: - An environment is a collection of (symbol, value) pairs, i.e. x is a symbol and 3.14 might be its value - Every environment has a parent environment, it’s possible for an env to have multiple children - The only env without a parent is an empty environment - A function + an environment = a closure or function closure
Similar to symbol binding above R wil follow these steps searching for the value of a free var - Look in the environment first, if not found move to - Parent environment (global) which is the Top level is where it goes next, if not found - Search continues down the list till it finds the value, or hits the empty environment then if still not found it gives an error.
Here below I’ll create a function make.power, and in it I’ll use another function to pass the parent functioin a value. - If you look inside the function you see another functioin pow - The child function has a free variable n that’s not defined in pow (the child function/inner function) - n is defined in the outer/parent function as it is the argument of make.power function - The value of n is being passed when we call it make.power(3) and here n = 3 - As you see we use pow inside the outer function to return the value, so in a sense we are returning a function as a value for another function. - So if we call make.power(3) it will return a function pow as you see below
function(x){x^n}
<environment: 0x0000027117e917b0>
How do we know the environment of a function? We can use the ls() function to list the objects in the environment
Optimization is to write a function in a clean, kind of readable manner. Sometimes you fix a certain parameter and optimize over the other parameters.
You can pass a function as an argument to another function just like you can pass data to a function. Hence Constructor function. If it’s too complicated at first look at the examples that follow which takes you step by step.
The constructor function constructs the objective function.
If you are going to use a complicated function where most of the calculations and number crunching is done with only minimal input, then you can create one function inside another. The hopes are that the inside function does most of the heavy lifting, using numerous arguments and calculations that need to be done regardless of what values are passed. This way instead of calling the parent function with a whole list of arguments, we can just limit that input and still not give up functionality.
Here is the negative log likelihood function being minimized: - Inside the constructor function I define function which takes an argument called p for the parameters. - This is going to be the parameter vector that I want to optimize over. - So basically what this function’s going to do is going to return neglog-likelihood for a normal distribution and I’m going to want to fit my data to this normal distribution. - We know that a normal distribution has two parameters, the mean, mu, and a standard deviation, sigma. So those are going to be the two parameters that I want to optimize over - I’m just defining, the log-likelihood, and taking the negative of it, so I can minimize it. - The constructor function does is returns the function as the return value.
make.NegLogLik <- function(data, fixed=c(FALSE,FALSE)) {
params <- fixed
function(p) {
params[!fixed] <- p
mu <- params[1]
sigma <- params[2]
a <- -0.5*length(data)*log(2*pi*sigma^2)
b <- -0.5*sum((data-mu)^2) / (sigma^2)
-(a + b) }
}
set.seed(1); normals <- rnorm(100, 1, 2)
nLL <- make.NegLogLik(normals)
nLL
function(p) {
params[!fixed] <- p
mu <- params[1]
sigma <- params[2]
a <- -0.5*length(data)*log(2*pi*sigma^2)
b <- -0.5*sum((data-mu)^2) / (sigma^2)
-(a + b) }
<bytecode: 0x000002711924dd20>
<environment: 0x0000027118c9bab0>
Let’s create a function so if a function is passed into the func argument, and some data (like a vector) is passed into the dat argument, the evaluate() function will return the result of dat being passed as an argument to func.
In other words func() will process dat and pass the results to the outer function: evaluate()
So let’s just worry about creating the outer function for now, and we’ll create the constructor function later, for now we’ll use an existing function to pass into the outer function which we’ll call evaluate()
In the example above we passed sd() as the argument func. sd() is already a predefined function. But what if we want to use our own function instead?
Well if you think about it we just did it with evaluate() we created our own function inside of it func() and we passed it arguments. One of the arguments happens to be a predefined function sd().
Now let’s make up our own function and define it on the go. Remember that evaluate(func, dat) takes two arguments (func, dat), the left argument/first argument is a function. So instead of passing it a pre-existing function let’s say we want to perform x+1 as the function.
As you already know that a function code is: function(x){ x + 1 }, so if this is the function we want to use and hasn’t been defined in advance “anonymous function” just put in the first argument of evaluate(func,dat) and supply a second argument for dat, let’s say 6 like this:
If you intend to perform a calculation over and over again, as opposed to a function, if the calculation is short and simple, you can create a binary operator all for yourself!! Let’s say I wanted to create an operator that multiplied two numbers and added one to the result.
The syntax is simple: “%whatever%” with whatever representing whatever
Let’s create this animal:
Let’s create one that pastes instead of using the paste () function and all it’s arguments.
To see the source code of a function just type the function without ()
Implement a function that returns the mean of a vector. Use the sum() and length() functions to achieve that:
You’re going to write a function called “remainder” where remainder() will take two arguments: “num” and “divisor” where “num” is divided by “divisor” and the remainder is returned.
Imagine that you usually want to know the remainder when you divide by 2, so set the default value of “divisor” to 2.
Please be sure that “num” is the first argument and “divisor” is the second argument. - Hint #1: You can use the modulus operator %% to find the remainder. - Example: 7 %% 4 evaluates to 3.
This function is just going to take two numbers, add them and return the answer.
This function will return a subset of numbers that are greater than 10 given a vector. Let’s call it above10.
As we already explained earlier on this page how to use a func inside a f, here we’ll use one to extract a subset as well. Do you remember
evaluate <- function(func, dat){
func (dat) }
# this call will take 6 and add 1 to it
evaluate(function(x){x+1},6)
[1] 7
Again remember everything to the left of the , is the first argument, and we can pass it any function sd, mean, sum or an anonymous fuction we just created as shown her to extract the first element of the vector that’s supplied as the second argument here c(8,4,0)
What if we want the last element of the vector?
Let’s make the above function more useful and allow the user to specify both the conditional value as well as the input vector
Let’s test it with:
Let’s say we want to guard against the user forgetting to input the conditonal value for n. Or let’s say 10 is expected to be used 95% of the time and we expect users to overlook it. Let’s set a default value for n. How do we do that? Same as before except for one addition:
Let’s test it without a value for n and see what happens:
This one is a little more complicated as we take our argument, loop through each column and calculate the mean for each one. * Let’s call it colmean * Input will be a matrix * We go through each column and calculate the mean of each col * Return all the means for all the columns
colmean <- function(y){
nc <- ncol(y) #ncol gives us the number of columns
means <- numeric(nc) #since we have multiple columns we need to store the mean of each in a vector, and the length of the vector is the same as (nc)
for (i in 1:nc){
means[i] <- mean(y[,i])}
means }
# so we assign the vector[i] the mean of that column. mean() function or all rows and
# column i mean(y[,i]) hence [,i] then return the means
Let’s test it with the airquality dataset
You will notice that the first 2 columns give us NA, because if a column has an NA you cannot calculate the mean of that column. Matrix has 6 columns total.
Let’s say we want to still calculate the mean of the rows that had values in those two columns from the above example? * Set a default parameter to removeNA just as we did earlier * Mean() accepts a na.rm parameter so set it there as well * See below
colmean <- function(y, removeNA = TRUE){
nc <- ncol(y) #ncol gives us the number of columns
means <- numeric(nc) #since we have multiple columns we need to store
#the mean of each in a vector, and the length of
#the vector is the same as (nc)
for (i in 1:nc){
means[i] <- mean(y[,i], na.rm = removeNA)}
means }
Test it again on the same dataset
There you get the mean of whatever values are in that first column. If you want to override it and not calculate the means of the first two columns, then send a FALSE in the arguments like this:
Usually the ellipses are either the first or last argument for a function. Remember every argument following the … MUST be named. We’ll work with paste as examples: If we want to set a default value to a paste function which has the following args list paste (…, sep = ” “, collapse = NULL) We can set it at the start where the … are since we don’t know what the … include we can do it like this: simon_says <- function(…){ paste(”Simon says:“, …) } So let’s make our own telegram function which start and end with START and STOP
mad_libs <- function(...){ # Do your argument unpacking here!
args <- list(...)
place <- args[["place"]]
adjective <- args[["adjective"]]
noun <- args[["noun"]]
paste("News from", place, "today where", adjective,
"students took to the streets in protest of the new",
noun, "being installed on campus.") }
mad_libs("x","x","x")
[1] "News from today where students took to the streets in protest of the new being installed on campus."