library(palmerpenguins)
library(dplyr)
Sample Data
Packages
R Data
R has built-in data for testing and we’ll cover that first:
- To access the list of default data in R use: data()
- To load the palemerpenguins data set you’ll need to load the package first then you can access it via penguins
- You’ll notice in the code above I only referenced the library(palmerpenguins) but the code will automatically load the corresponding package
Sequences
Vector
<- c(17, 14, 4, 5, 13, 12, 10)
x x
[1] 17 14 4 5 13 12 10
#set all elements of this vector that are greater than 10 to be equal 4
> 10] <- 4
x[x x
[1] 4 4 4 5 4 4 10
#set all elements >= 11 to be equal to 4
>=11] <- 4
x[x x
[1] 4 4 4 5 4 4 10
Series
Here are some examples of how to generate a series of numbers in R
1:20
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
:10 pi
[1] 3.141593 4.141593 5.141593 6.141593 7.141593 8.141593 9.141593
15:1
[1] 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
Sequence
seq(1,20)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
incremented
seq(0,10,by=0.5)
[1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
[16] 7.5 8.0 8.5 9.0 9.5 10.0
by length
seq(5,10,length=30)
[1] 5.000000 5.172414 5.344828 5.517241 5.689655 5.862069 6.034483
[8] 6.206897 6.379310 6.551724 6.724138 6.896552 7.068966 7.241379
[15] 7.413793 7.586207 7.758621 7.931034 8.103448 8.275862 8.448276
[22] 8.620690 8.793103 8.965517 9.137931 9.310345 9.482759 9.655172
[29] 9.827586 10.000000
along.with
<- seq(5,10,length=30)
my_seq seq(along.with=my_seq)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30
seq_along
seq_along(my_seq)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30
Replicate
If we’re interested in creating a vector that contains 40 zeros, we can use rep(0, times = 40).
rep(0, times=40)
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[39] 0 0
If you want a vector of 1000 NAs
<- rep(NA, 1000) z
pattern replication
If instead we want our vector to contain 10 repetitions of the vector (0, 1, 2), we can do
rep(c(0,1,2), times=10)
[1] 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
design pattern
What if we want to have 0 ten times, 1 10 times, 2 ten times
rep(c(0,1,2),each=10)
[1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Sample
- Syntax:
sample(length(x), size, replace, prob)
- Let’s select 10 elements at random from these 20 values such that we don’t know how many NAs
sample(c(1,5),10, replace=TRUE)
[1] 1 1 5 1 5 1 1 1 1 5
dice roll
- Let’s simulate a six-sided dice rolled 4 times.
- Randomly select 4 numbers between 1-6 with replacement which means after selecting a number it is replaced with another random number from the sample
sample(1:6, 4, replace = TRUE)
[1] 6 6 6 1
without replacement
- In order to sample without replacement, the population size has to be larger than the sample size
- To sample 10 numbers from 1-20 without replacement, means no number appears more than once since we want 10 out of 20
- So the population has to be larger than 10 and it is 20
sample(1:20, 10)
[1] 2 7 14 8 12 6 9 15 18 1
seed
- Always set the seed so the data can be reproduced and code tested
set.seed(13435)
<- data.frame("var1"= sample(1:5), "var2"=sample(6:10), "var3"=sample(11:15))
x x
var1 var2 var3
1 3 8 14
2 1 7 15
3 5 6 13
4 4 10 12
5 2 9 11
shuffle
- All you have to do to shuffle a sample df is to run it again (sample it again)
- Here we’ll resample the five rows and all columns
<- x[sample(1:5),]
x x
var1 var2 var3
5 2 9 11
4 4 10 12
1 3 8 14
2 1 7 15
3 5 6 13
insert NAs
- Let’s insert an NA at row 1 & 3 for var2
$var2[c(1,3)] = NA x
subset
- We already know how to subset, so let’s extract all rows from the first column
1] x[,
[1] 2 4 3 1 5
# subset first two rows of all columns
1:2,] x[
var1 var2 var3
5 2 NA 11
4 4 10 12
# of course to extract the first 2 rows from column3
1:2,"var3"] x[
[1] 11 12
# or
1:2,3] x[
[1] 11 12
probability
- What if we want to simulate a sample of coin flip and want to set the probability to hit 0 at 30% and to hit 1 at 70%
<- sample(c(0,1), 100, replace = TRUE, prob = c(0.3, 0.7))
flips flips
[1] 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
[38] 1 1 0 1 1 1 0 0 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1
[75] 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 1 0 0 1 0 1 0 0 1
- Let’s see if the code really held up to what we asked in 70/30
sum(flips)
[1] 75
letters
- LETTERS is predefined in R.
- If size argument is not defined R outputs the same size as the one you are sampling from
sample(LETTERS)
[1] "F" "Q" "I" "M" "A" "J" "S" "N" "V" "P" "W" "C" "H" "R" "X" "L" "Y" "K" "T"
[20] "G" "D" "O" "B" "E" "Z" "U"
runif
- runif(number of observations (n), min =0, max = 1)
- generates 20 uniformly distributed random variables between 0 and 1
- you can skip min and max unless you want to set your own values
<- runif(20)
u u
[1] 0.74162842 0.45010751 0.50547285 0.66180244 0.49817421 0.56429714
[7] 0.36579150 0.37193647 0.05800880 0.77295996 0.21509443 0.88119367
[13] 0.22669610 0.56048141 0.23787464 0.01462926 0.87771995 0.93767246
[19] 0.11087709 0.26739988
- what if we want values between 0-10 instead
runif(20,0,10)
[1] 7.3582436 6.1997381 2.2256081 5.0161484 9.5722421 2.7662729 8.9042757
[8] 2.9351149 6.7219902 0.8150495 4.3802629 6.6191075 4.2666689 9.8145996
[15] 5.9446412 0.9706589 8.5508502 8.7958897 5.2973956 4.7006625
Factors
gl
Generates factors by specifying the pattern of their levels. - Syntax: gl(n, k, length= n*k, labels=seq_len(n), ordered = FALSE) - n: integer for the number of levels - k: integer for the number of replications - length: integer giving the length of the result - labels: an optional vector of labels for the resulting factor levels - ordered: logical indicating whether the result should be ordered or not
Examples
- n: number of levels, so if we want to have a vector of values from 1-3 we set n=3
- k: so how many times we want each level replicated before we move to next level. let’s say we set k=2, that means each of the n levels (n=3) will be replicated k times before we move on to the next level.
- So if we have k=2 that means we’ll start with (1 1) that’s 2 replications, then 22 then 33…
- So we will have 11 22 33 11 22 33 so as you see each level (1-3) is replicated 2 times(k=2) till we reach the length we designed in length=20
# for 3 levels and 2 replications
gl(3,2,20)
[1] 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3 1 1
Levels: 1 2 3
# for 3 levels and 3 replications
gl(3,3,20)
[1] 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1
Levels: 1 2 3
vector & factor
- Define a factor variable and use n=3 (number of levels) k=10(number of replication)
- The length will match the length of x defined below
<- gl(3,10)
f f
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
tapply for mean
- Let’s calculate the mean for each level
- So tapply will take the value of each x at each level and calculate the mean
<- c(rnorm(10),runif(10),rnorm(10,1))
x tapply(x,f,mean)
1 2 3
0.0752080 0.7772303 1.1945657
tapply for range
- Let’s calculate the range of x for each level
tapply(x,f, range)
$`1`
[1] -1.438836 1.263505
$`2`
[1] 0.5770678 0.9637785
$`3`
[1] -0.8002445 3.2803129
Distributions
R includes a set of random number generators that allow us to simulate from well-know probability distributions, Normal, Poisson, binomial…
For each probability distribution there are typically four functions that start with r, d, p and q, with r the one that actually generates random numbers from that distribution. Here is the breakdown for each:
- d - for density
- r - for random number generation
- p - for cumulative distribution
- q - for quantile function (inverse cumulative distribution)
Normal
The most common probability distribution to work with is the Normal distribution which is the Gaussian distribution. Here is the syntax for norm:
dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)
rnorm
The rnorm() is used to generate a vector of normally distributed random numbers. - set.seed() is used with conjunction with it so we can reproduce the problem if an error occurs. - Syntax: rnorm(n, mean, sd) - n is the Number of observations (sample size) - mean, is the value of the observation data, default is zero - sd, the standard deviation, default is 1 - If you want to generate random numbers with a custom mean, you can pass it a mean value as the second argument - When you custom the mean it is wise to set.seed that makes it easier to reproduce
# for 10 random numbers from a standard distribution with mean=0 and sd=1
<- 1:10
x <- rnorm(10)
y y
[1] -1.22089371 -1.28474615 -1.12825550 -0.60341370 -0.60276400 -1.66756271
[7] 2.34430392 -0.52450935 -0.01157205 -0.78163359
plot(x,y, type="l", lwd=2)
hist(y)
- Let’s repeat it with mean=100 and sd=25
<- rnorm(10, 100, 25)
yy plot(x,yy, type="l", lwd=2)
hist(yy)
- Let’s repeat it with n=1000 mean=50 and sd=10
- Now you see the shape of the distribution after we draw enough observations
<- rnorm(1000,50,10)
z hist(z)
Poisson
A Poisson distribution is a tool that helps to predict the probability of certain events happening when you know how often the event has occurred. It gives us the probability of a given number of events happening in a fixed interval of time.
The number of diners in a certain restaurant every day. If the average number of diners for seven days is 500, you can predict the probability of a certain day having more customers.
Because of this application, Poisson distributions are used by businessmen to make forecasts about the number of customers or sales on certain days or seasons of the year. In business, overstocking will sometimes mean losses if the goods are not sold. Likewise, having too few stocks would still mean a lost business opportunity because you were not able to maximize your sales due to a shortage of stock. By using this tool, businessmen are able to estimate the time when demand is unusually higher, so they can purchase more stock. Hotels and restaurants could prepare for an influx of customers, they could hire extra temporary workers in advance, purchase more supplies, or make contingency plans just in case they cannot accommodate their guests coming to the area.
P(x; μ) = (e-μ \* μx) / x!
Where:
- The symbol “!” is a factorial.
- μ (the expected number of occurrences) is sometimes written as λ. Sometimes called the event rate or rate parameter.
Example The average number of major storms in your city is 2 per year. What is the probability that exactly 3 storms will hit your city next year?
Step 1: Figure out the components you need to put into the equation. - μ = 2 (average number of storms per year, historically) - x = 3 (the number of storms we think might hit next year) - e = 2.71828 (e is Euler’s number, a constant)
rpois
- Let’s pull 5 random values with mean=10
rpois(5, 10)
[1] 7 8 11 7 5
replicate
- Let’s replicate the same operation 100 times using replicate
<- replicate( 100, rpois(5,10)) my_pois
colMeans
- Replicate created a matrix with 100 columns and 5 rows with 5 random numbers.
- Let’s find the mean of each col
<- colMeans(my_pois)
cm cm
hist
- If we plot the histogram
hist(cm)
Binomial
The binomial distribution is a discrete probability distribution that calculates the likelihood an event will occur a specific number of times in a set number of opportunities. Use this distribution when you have a binomial random variable. These variables count how often an event occurs within a fixed number of trials. They have only two possible outcomes that are mutually exclusive.
Binomial probability distribution can answer the following questions. What’s the probability of getting: - Six heads when you toss the coin ten times? - 12 women in a sample size of 20? - 3 defective items in a batch of 100? - 2 flu infections in 20 years? This distribution is an example of a Probability Mass Function (PMF) because it calculates likelihoods for discrete random variables. It is an extension of the Bernoulli distribution that can model only 1 trial.
3 sixes in 10 rolls
Suppose you’re playing a game where rolling sixes on a die is really good. You want to know the probability of rolling exactly three sixes in ten die rolls. In this example, - the number of events is x=3 - the number of trials is N = 10 - the probability P is 1/6 = 0.1667
coin flip
- We used the same example in sample above, let’s try it with a binomial distribution here.
- A coin flip is a binary outcome, it’s either 0 or 1 and we are performing 100 trials = N, so let’s use a binomial variable, and we specify the first argument to be 1 = X and we want to see what’s the probability that 1 comes up 70% of the time P=0.7 if we run it 100 times.
- So we want to know how often does 1 come up if we repeat it 100 times if the probability is 70%
rbinom( 1, size = 100, prob = 0.7)
[1] 70
- If we want to see all the 0’s and 1’s we can request 100 observations of 1 (success) instead of 100 trials like we did above with a prob=0.7 and assign it to a variable
<- rbinom( 100, size = 1, prob = 0.7)
flips2 flips2
[1] 1 1 1 1 1 0 0 0 0 0 1 0 1 0 0 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 0 1 1 1 1
[38] 1 0 1 1 1 0 1 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1
[75] 1 1 1 1 0 0 1 0 0 1 1 1 0 1 1 1 1 0 1 0 0 1 1 0 1 1
- Let’s see how many times we had
sum(flips2)
[1] 67