Chapter 10 Simulation
10.1 Description of the session goal
In this session we will learn about:
Packages
Sampling
Functions
Loops
At the end we’ll try to make a concrete use of all of it and answer an interesting question in probability.
Ready?
10.2 Find and install R packages
A nice website to find packages is https://www.rdocumentation.org
That’s how I found about the package “ukbabynames” that contains data for all baby names given at least three times between 1996 and 2015.
To install a package called “ukbabynames”, we do
install.packages("ukbabynames")
Once a package is installed in your computer, it will remain installed and so you only need to install the package again if you’re updating your version of R.
However, each time you open a new session of R or Rstudio, you need to load the package if you want to use it:
library(ukbabynames)
I’m sure you noticed that we’ve done it several times already with packages like ggplot2
or dplyr()
.
Most often we are interested in the functions of the packages we install, but a package can also contain datasets and in this case, the package “ukbabynames” contains a dataframe of the same name and this is what we will use from it:
View(ukbabynames)
QUESTION 2.1: The function say()
from the package “cowsay” is useless, but funny. Install the package “cowsay”. Then, when the installation is finished (you have a to wait a bit, like for any package you would install), load the package and try to run the command say(Hello R!)
. What happens?
???.???("cowsay")
???(cowsay)
say("Hello R!")
10.3 Sampling from data or known distributions
10.4 Sampling from data
The sample()
function allows you to choose a random sample from a list of values.
sample(1:30, 5)
## [1] 4 19 5 20 11
The first argument is the values to sample from and the second is the number of values to sample. If we omit the second argument, we get all the values, but in a random order:
sample(1:12)
## [1] 2 9 8 11 3 5 4 10 1 7 6 12
By default sample()
samples without replacement, so when you do
sample(1:5, 10)
You get an error. To sample with replacement you need to add the argument replace=TRUE:
sample(1:5, 10, replace=TRUE)
## [1] 5 2 2 3 3 4 2 2 4 5
10.5 Generating random numbers from distributions
We can also generate random numbers from known distribution.
runif (uniform)
rpois (poisson)
rnorm (normal)
rbinom (binomial)
rgamma (gamma)
rbeta (beta)
The first argument for all of these is n, the number of samples to generate. Following arguments can specify parameters for the distribution.
Here are three random numbers coming from a standard normal distribution:
rnorm(3)
## [1] -0.5218292 0.1383097 -1.1344200
here, six random numbers coming from a normal distribution with mean 5 and standard deviation 2:
rnorm(6, 5, 2)
## [1] 5.649870 6.513839 3.417192 7.307390 4.027937 4.192629
And here is 5 random numbers coming from a uniform distribution between 0 and 1:
runif(5)
## [1] 0.1959726 0.4794798 0.2787031 0.9619588 0.6556468
10.6 Reproducing samples
Note that if you run any of these commands again you’ll get a different sample. Sometimes you want your random sample to be reproducible i.e. you want to get the same random sample each time you run the script.
This is possible because the generation of randomness of any programming language is based on some algorithm that takes some “seed” number as input. It is usually based on the internal clock of your computer and so it changes all the time. But we can fix the seed manually by running the function set.seed()
with an integer inside the parenthesis
QUESTION 3.1: Produce 5 random integers between 1 and 20, without replacement. Set a seed before sampling and confirm that you get the same sample each time you run it.
???
sample(???, ???)
10.7 Sampling from data
We can sample from data in the same way as above. Here I am sampling 10 random names from the ukbabynames dataframe:
sample(ukbabynames$name, 10)
## [1] "Avery" "Esmie" "Preeti" "Luan" "Lyam"
## [6] "Faraaz" "Jett" "Kydan" "Konstantin" "Ramona"
Note also that the package dplyr contains its own functon sample_n() that allow you to sample the number of rows that you indicate as its second argument (which is the first argument if you use “%>%”):
library(dplyr)
ukbabynames %>% sample_n(8)
## year sex name n rank
## 1 2015 F Sabriyah 3 5730
## 2 2012 M Yanky 3 4805
## 3 2007 F Yasmin 385 141
## 4 2000 F Lucinda 78 405
## 5 2002 M Henrik 4 2636
## 6 2007 F Bryanna 10 2233
## 7 2005 M Shahriyar 3 3849
## 8 1999 F Shahnaz 6 2444
10.8 Making functions
10.9 When to write your own functions
If there’s already a function in R that does what you want then writing it yourself will almost definitely be less efficient as well as wasting your time. For example, writing a function to calculate the standard deviation of a column is not needed since there is already the sd()
function.
Many common things will already have function either in base R or a package, however if you find yourself doing the same specific task multiple times, writing your own function can save you copying and pasting code and this is always desirable.
We define a function that way:
myFunction <- function(){
}
and use it like any other function
myFunction()
## NULL
Yes for now our function doesn’t do anything, because there is nothing inside the brakets (note tha they are curly brackets by the way, not normal or square brackets!).
We can make the function return something, with the function return()
myFunction <- function(){
return(2)
}
myFunction()
## [1] 2
Our function just returns two each time we call it. What an awesome function!!!
Let’s put inside some code that generates the sum of 3 uniformely distributed random numbers and let’s return that sum
myFunction <- function(){
few_runif <- runif(3)
sum_few_runif <- sum(few_runif)
return(sum_few_runif)
}
myFunction()
## [1] 1.418794
Yayy! Did you see that each time you run it you get a different number?
QUESTION 4.1: Create a function called myFunctionNorm that would return the sum of 5 normally distributed random numbers. Run it to see if it works
??? <- function(){
few_rnorm <- ???(???)
sum_few_rnorm <- ???(???)
return(???)
}
???()
You may be tempted to put a number inside the parenthesis. Would myFunction(5) generate 5 random numbers?
myFunction(5)
Argh, no, because we didn’t define what the function should do with what is inside the parenthesis.
We can generate more than one number with the function replicate() though, which simply runs what we put as its second argument (second thing inside the parentheses), the number of times we indicate as its first argument:
randomNumbers <- replicate(5, myFunction())
randomNumbers
## [1] 1.2500121 0.4373939 1.9210987 2.2747847 2.3319412
If we felt that the command replicate(number, myFunction()) was a bit too long to write, we could make a function similar to rnorm(), runif() etc. that gives the size of the sample as the first argument:
rsum3unif <- function(n){
some_sum3unif <- replicate(n,myFunction())
return(some_sum3unif)
}
rsum3unif(8)
## [1] 2.247655 1.021067 1.093353 1.157237 1.313873 1.372789 1.183770 1.415288
And here it is, we made a function that takes a number as its first (and only) argument, which it then used as the first argument of the function replicate, to generate the required number of random numbers (made from the sum of 3 uniformely distributed reandom numbers).
QUESTION 4.2: the function below is supposed to print in the console a smiley that says the words given as its argument. What would you need to write in place of ??? for it to work? (note: the function cat() prints what’s inside its parenthesis on the screen. “\\n” tells R to go to the next line). Try it
smileysay <- function(???){
cat("----------\n")
cat(words)
cat("\n----------")
cat("\n ")
cat("\n ^ ^ ")
cat("\n | ")
cat("\n ___ ")
cat("\n | | ")
cat("\n --- ")
}
smileysay("Hello R!")
10.10 Iterations (loops)
Here is a simple loop for:
for(i in 3:9) {
print(i)
}
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
The first part: (i in 3:9) provides the set of values to loop over. What is inside the curly brackets are the things that we are asking R to do at each iteration of the loop. Here we are asking to print i.
So we can read the above code as “i will iteratively take all the values from 3 to 9 and for each iteration i should be printed on the screen”.
Instead of 3:9 we could put any kind of list of elements, and we don’t need i to be called i. For example let’s loop over a list of 10 random numbers, with the iteration variable called rnumber instead of i:
randomNumbers<- runif(10)
for(rnumber in randomNumbers) {
print(rnumber)
}
## [1] 0.7378243
## [1] 0.5278535
## [1] 0.9049213
## [1] 0.2462606
## [1] 0.9035252
## [1] 0.872111
## [1] 0.6422247
## [1] 0.5169815
## [1] 0.9433067
## [1] 0.5461632
QUESTION 5.1: Make a loop with 12 iterations that generates and print at each iteration a random number coming from a standard normal distribution . Call the iteration variable “iter”.
for(??? in 1:???) {
random_number <- ???(1)
print(???)
}
In addition to being able to make loops, we will need to know how to check whether the name found in a box is or not the the name of the prisoner. We will do that with the function if()
If() works like the function for(), but is a bit simpler: if what is inside the parenthesis is true, then what is inside the curly brackets is performed. Let’s generate a random number and check if it is larger than 0:
aNummber <- rnorm(1)
aNummber
## [1] 0.8235913
if(aNummber>0) {
print("it is larger than 0")
}
## [1] "it is larger than 0"
If you run the above code multiple times, you will see that sometimes something will be printed and sometimes not.
We can also ask R to perform an action when the condition inside the parenthesis is false, using the else command:
aNummber <- rnorm(1)
aNummber
## [1] 1.219121
if(aNummber>0) {
print("it is larger than 0")
} else{
print("it is negative!")
}
## [1] "it is larger than 0"
Also note that there exists another function called “ifelse()” that allows us to check a same condition on multiple elements at the same time. It works more like a normal function, the first argument being the condition to check, the second is what the function must return if the condition is checked and the third, what it must return if it is not. Let’s generate 10 random numbers, check whether they are greater than 0, and for each of them, return 1 if this condtion is true and 0 if it is not:
fewNummbers <- rnorm(10)
fewNummbers
## [1] 0.48957909 -1.22477895 -0.86590124 -0.03382881 1.81311364
## [6] -1.00501278 1.27771063 1.06059840 0.04733959 -0.04984811
ifelse(fewNummbers>0, 1, 0)
## [1] 1 0 0 0 1 0 1 1 1 0
Do you see that you could achieve the same thing by looping over the 10 random number? Actually, R contains lots of functions like ifelse() that we could replace with loops but these functions are optimize and therefore much faster than using loops. Very often you’ll be able to avoid using loops, and you should try so. They will only be unavoidable in some very specific situations.
QUESTION 5.2: can you produce the same sequence of 0 and 1 as above, but using a loop for() and the if() function?
for(i in ???){
if(???){
print(???)
}
else{
print(???)
}
}
10.11 Puttng it all together
Here is a version of a very hard probability problem, with a very unintuitive result, that was first published by Peter Bro Miltersen and Anna Gal in the Proceedings of the 30th International Colloquium on Automata, Languages and Programming (ICALP):
The names of 100 prisoners are placed in 100 wooden boxes, one name to a box, and the boxes are lined up and numbered on a table in a room. One by one, the prisoners are led into the room; each may look in at most 50 boxes, but must leave the room exactly as he found it and is permitted no further communication with the others. The prisoners have a chance to plot their strategy in advance, and they are going to need it, because unless every single prisoner finds his own name all will subsequently be executed.
QUESTION Can you guess what is the probability of success of the best strategy?
One simple strategy would be for each prisoner to open 50 boxes at random. Since there are 100 boxes in total, each prisoner will have a probability of 1/2 to open the box that contains his name. Since each prisoner chooses randomly, the probability of success of the prisoners are independant and so the probability that all the prisoners will manage to open the box that contains their name is:
(1/2)^100
## [1] 7.888609e-31
… quite low!
Now the best strategy is quite simple, but understanding why it is the best is not at all obvious. It requires quite heavy math thinking about permutations and cycles. The same goes for calculating the associated probability of success. What we can do though is make some simulations to get an approximation of it.
The best strategy is the following: First, the prisoners should agree on a random assignment of each box to a prisoner’s name. We can imagine they all have in mind an imaginary common labelling of the 100 boxes.
Then each prisoner opens the box assigned to him. If the box contains his name, great. Otherwise, he opens the box labelled with the name found in this first box, and then he opens the box labelled with the name found in the second box etc. until he has either found his name, or opened 50 boxes.
10.12 Simulation of the best strategy
First, we need 100 names for our prisoners though. Here they are:
set.seed(79)
list_prisoners <- sample(ukbabynames$name, 100)
list_prisoners
## [1] "Abela" "Taher" "Fynlay" "Gisele" "Wania"
## [6] "Adelaide" "Mansoor" "Larissa" "Hayley" "Iona"
## [11] "Anesh" "Ashir" "Luisa" "Shakti" "Scarlett"
## [16] "Evie" "Tiger" "Geoffrey" "Cassie" "Shariq"
## [21] "Alarna" "Rawdah" "Hadassah" "Ikram" "Amia"
## [26] "Jamieleigh" "Demilade" "Kovan" "Demi-Rae" "Azim"
## [31] "Gaurav" "Christos" "Jarin" "Domenico" "Manaal"
## [36] "Elijah" "Katie-Mae" "Abdulhakeem" "Krystal" "Nana"
## [41] "Stacie" "Yaw" "Kay" "Lily-Ann" "Sola"
## [46] "Iara" "Honie" "Liana" "George" "Kaylem"
## [51] "Khushal" "Bushra" "Yasa" "Filipa" "Braidon"
## [56] "Lois" "Daisy-may" "Esmay" "Alisdair" "Sunaina"
## [61] "Muhanad" "Huw" "Taegan" "Catelynn" "Byron"
## [66] "Gage" "Delphi" "Faiz" "Shaam" "Zafir"
## [71] "Rivaan" "Haajira" "Amye" "Warrick" "Sajan"
## [76] "Ephraim" "Evie-Lea" "Ayush" "Maisie" "Ugochi"
## [81] "Samad" "Bhavika" "Delara" "Monet" "Kalani"
## [86] "Aashi" "Hajira" "Ishika" "Isra" "Kerensa"
## [91] "Samy" "Mariam" "Tinotenda" "Shiloh" "Glory"
## [96] "Diego" "Dayton" "Jeorgie" "Promise" "Payal"
What we will need to do then is sample from this list to randomly place the names of the prisoners in the boxes:
name_inside_box <- sample(list_prisoners)
name_inside_box[1:10]
## [1] "Gaurav" "Jeorgie" "Ashir" "Bhavika" "Dayton" "Azim"
## [7] "Jarin" "Amia" "Hadassah" "Stacie"
So for this simulation, in the first box there is the name name_inside_box[1]
, in the second one, name_inside_box[2]
etc.
And we can do the same to simulate the imaginary random labelling of the boxes by the prisoners when they develop their strategy:
label_on_box <- sample(list_prisoners)
label_on_box[1:10]
## [1] "Haajira" "Ishika" "Lily-Ann" "Delara" "Ashir" "Kerensa"
## [7] "Demilade" "Tiger" "Bhavika" "Evie"
And so label_on_box[1]
will open the first box, where he will find the name name_inside_box[1]
, which is not his own name, so he will look for the box labelled name_inside_box[1]
.
We can find this box with the function which()
:
box_to_open <- which(label_on_box==name_inside_box[1])
box_to_open
## [1] 49
So he will look in the box box_to_open
where there is the name
name_inside_box[box_to_open]
## [1] "Kay"
Etc. etc.
We definitely need to make a loop here… and we also need a function that will encapsulate all our code, so that that we can easily do many simulations using replicate().
Here is our empty function. Note that from now on, most of the code below won’t return anything, but try to follow how the function is filled.
simu_game <- function(){
}
We start by initiating the game and strategy by making a random assignment of names in the boxes and a random assignment of imaginary labels on the boxes:
simu_game <- function(){
name_inside_box <- sample(list_prisoners)
label_on_box <- sample(list_prisoners)
}
We will then need two for()
loops, one to loop over the prisoners and one to loop over the attempts
simu_game <- function(){
name_inside_box <- sample(list_prisoners)
label_on_box <- sample(list_prisoners)
for(prisoner in list_prisoners){
for(attempt in 1:50){
}
}
}
At each attempt, we will look at the name that is inside the box that the prisoner needs to open, and check whether it is or not the name of the prisoner:
simu_game <- function(){
name_inside_box <- sample(list_prisoners)
label_on_box <- sample(list_prisoners)
for(prisoner in list_prisoners){
for(attempt in 1:50){
name_found_in_box <- name_inside_box[box_to_open]
if(name_found_in_box==prisoner){
}
}
}
}
For now we didn’t define what is the box to be opened. At the beginning it is the box labelled with the name of the prisoner. After the first attempt, it is the box labelled with the name found in the previous box
simu_game <- function(){
name_inside_box <- sample(list_prisoners)
label_on_box <- sample(list_prisoners)
for(prisoner in list_prisoners){
box_to_open <- which(label_on_box==prisoner)
for(attempt in 1:50){
name_found_in_box <- name_inside_box[box_to_open]
if(name_found_in_box==prisoner){
}
box_to_open <- which(label_on_box==name_found_in_box)
}
}
}
Let’s now add a variable called “name_found”. Let’s make it equal to FALSE for each prisoner at the beginning and let’s change it to TRUE if the name found in a box is the name of the prisoner:
simu_game <- function(){
name_inside_box <- sample(list_prisoners)
label_on_box <- sample(list_prisoners)
for(prisoner in list_prisoners){
name_found <- FALSE
box_to_open <- which(label_on_box==prisoner)
for(attempt in 1:50){
name_found_in_box <- name_inside_box[box_to_open]
if(name_found_in_box==prisoner){
name_found <-TRUE
}
box_to_open <- which(label_on_box==name_found_in_box)
}
}
}
And let’s add one last variable called “success”. It will be equal to 1 at the very beginning and changed to 0 if a prisoner doesn’t find their name:
simu_game <- function(){
name_inside_box <- sample(list_prisoners)
label_on_box <- sample(list_prisoners)
success <- 1
for(prisoner in list_prisoners){
name_found <- FALSE
box_to_open <- which(label_on_box==prisoner)
for(attempt in 1:50){
name_found_in_box <- name_inside_box[box_to_open]
if(name_found_in_box==prisoner){
name_found <-TRUE
}
box_to_open <- which(label_on_box==name_found_in_box)
}
if(name_found==FALSE){
success <- 0
}
}
}
Ok I think we’re done now. At the very end, we just need to return the value of “success”.
simu_game <- function(){
name_inside_box <- sample(list_prisoners)
label_on_box <- sample(list_prisoners)
success <- 1
for(prisoner in list_prisoners){
name_found <- FALSE
box_to_open <- which(label_on_box==prisoner)
for(attempt in 1:50){
name_found_in_box <- name_inside_box[box_to_open]
if(name_found_in_box==prisoner){
name_found <-TRUE
}
box_to_open <- which(label_on_box==name_found_in_box)
}
if(name_found==FALSE){
success <- 0
}
}
return(success)
}
Let’s try our function:
simu_game()
## [1] 0
And now let’s replicate it a hundred times and make the sum of all the success
simulations <- replicate(100,simu_game())
sum(simulations)
## [1] 30
Out of our 100 simulations, the strategy was successful 30 times! If we compare it to the probability of success of the completely random strategy, that is a huge improvement!
QUESTION: Could you modify our simu_game function so that it takes the number of boxes that each prisoner can open as its argument. How often does the strategy seem to work if each prisoner can open 75 boxes (there are three ??? that you need to replace)
simu_game <- function(???){
name_inside_box <- sample(list_prisoners)
label_on_box <- sample(list_prisoners)
success <- 1
for(prisoner in list_prisoners){
name_found <- FALSE
box_to_open <- which(label_on_box==prisoner)
for(attempt in 1:???){
name_found_in_box <- name_inside_box[box_to_open]
if(name_found_in_box==prisoner){
name_found <-TRUE
}
box_to_open <- which(label_on_box==name_found_in_box)
}
if(name_found==FALSE){
success <- 0
}
}
return(success)
}
simulations <- replicate(100,simu_game(???))
sum(simulations)
BONUS QUESTION: We could be a bit more efficient by using while() loops instead of for() loops. Can you find how to modify the code to use while() loops? (Note: You just need conditions inside the parenthesis of while(). You do need to add a few lines though to make it work. these lines are added with ??? below)
simu_game <- function(){
name_inside_box <- sample(list_prisoners)
label_on_box <- sample(list_prisoners)
success <- 1
???<-1
while(??? & ???){
prisoner <- list_prisoners[i]
name_found <- FALSE
box_to_open <- which(label_on_box==prisoner)
???<-1
while(??? & ???){
name_found_in_box <- name_inside_box[box_to_open]
if(name_found_in_box==prisoner){
name_found <-TRUE
}
box_to_open <- which(label_on_box==name_found_in_box)
??? <-???+1
}
if(name_found==FALSE){
success <- 0
}
??? <- ???+1
}
return(success)
}
simulations <- replicate(100,simu_game())
sum(simulations)