--- title: "The simTool package: Facilitate simulations" author: "Marsel Scheer" date: "`r Sys.Date()`" output: html_document: fig_caption: false number_sections: true toc: true toc_float: collapsed: false smooth_scroll: false vignette: > %\VignetteIndexEntry{The simTool package: Facilitate simulations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message = FALSE} knitr::opts_chunk$set(comment = "") options(width = 100, max.print = 100) set.seed(123456) library(simTool) library(dplyr) library(tidyr) library(tibble) library(ggplot2) library(broom) library(boot) ``` # A first example 1. Generate data 2. Fit different models 3. Repeat step 1 and 2 three times 4. Summarize the results with respect to the different data generating functions and models by calculating mean and standard deviation for the corresponding model terms ```{r} regData <- function(n, SD) { x <- seq(0, 1, length = n) y <- 10 + 2 * x + rnorm(n, sd = SD) tibble(x = x, y = y) } eval_tibbles( expand_tibble(fun = "regData", n = 5L, SD = 1:2), expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")), post_analyze = broom::tidy, summary_fun = list(mean = mean, sd = sd), group_for_summary = "term", replications = 3 ) ``` # Introduction The purpose of the *simTool* package is to disengage the research from any kind of administrative source code which is usually an annoying necessity of a simulation study. This vignette will give an introduction into the *simTool* package mainly by examples of growing complexity. The workhorse is the function *eval_tibbles*. Every parameter of this function will be discussed briefly and the functionality is illustrated by at least one example. # Workflow The workflow is quite easy and natural. One defines two *data.frames* (or *tibbles*), the first one represents the functions that generate the data sets and the second one represents the functions that analyze the data. These two *data.frames* are passed to *eval_tibbles* which conducts the simulation. Afterwards, the results can nicely be displayed as a *data.frame*. # Defining the *data.frames* for data generation and analysis There are 3 rules: * the first column ( a *character* vector) defines the functions to be called * the other columns are the parameters that are passed to function specified in the first column * The entry *NA* will not be passed to the function specified in the first column. The function *expand_tibble* is a convenient function for defining such *data.frames*. We now define the data generation functions for our first simulation. ```{r} print(dg <- dplyr::bind_rows( expand_tibble(fun = "rexp", n = c(10L, 20L), rate = 1:2), expand_tibble(fun = "rnorm", n = c(10L, 20L), mean = 1:2) )) ``` This *data.frame* represents 8 *R*-functions. For instance, the second row represents a function that generates 20 exponential distributed random variables with *rate* 1. Since *mean=NA* in the second row, this parameter is not passed to *rexp*. Similar, we define the *data.frame* for data analyzing functions. ```{r} print(pg <- dplyr::bind_rows( expand_tibble(proc = "min"), expand_tibble(proc = "mean", trim = c(0.1, 0.2)) )) ``` Hence, this *data.frame* represents 3 *R*-functions i.e. calculating the minimum and the arithmetic mean with *trim=0.1* and *trim=0.2*. # The workhorse *eval_tibbles* The workhorse *eval_tibbles* has the following simplified pseudo code: ```{r, eval=FALSE} 1. convert dg to R-functions {g_1, ..., g_k} 2. convert pg to R-functions {f_1, ..., f_L} 3. initialize result object 4. append dg and pg to the result object 5. t1 = current.time() 6. for g in {g_1, ..., g_k} 7. for r in 1:replications (optionally in a parallel manner) 8. data = g() 9. for f in {f_1, \ldots, f_L} 10. append f(data) to the result object (optionally apply a post-analyze-function) 11. optionally append data to the result object 12. optionally summarize the result object over all replications but separately for f_1, ..., f_L (and optional group variables) 13. t2 = current.time() 14. Estimate the number of replications per hour from t1 and t2 ``` The object returned by *eval_tibbles* is a *list* of class *eval_tibbles*. ```{r} dg <- expand_tibble(fun = "rnorm", n = 10, mean = 1:2) pg <- expand_tibble(proc = "min") eg <- eval_tibbles(data_grid = dg, proc_grid = pg, replications = 2) eg ``` As you can see, the function always estimates the number of replications that can be done in one hour. ## Parameter *replications* Of course, this parameter controls the number of replications conducted. ```{r} eg <- eval_tibbles(data_grid = dg, proc_grid = pg, replications = 3) eg ``` ## Parameter *discard_generated_data* *eval_tibbles* saves ALL generated data sets. ```{r} eg <- eval_tibbles(data_grid = dg, proc_grid = pg, replications = 1) eg$simulation eg$generated_data ``` In general, it is sometimes very handy to have the data sets in order to investigate unusual or unexpected results. But saving the generated data sets can be very memory consuming. Stop saving the generated data sets can be obtained by setting *discardGeneratedData = TRUE*. See command line 11 in the pseudo code. ## Parameter *summary_fun* As stated in command line 12 we can summarize the result objects over all replications but separately for all data analyzing functions. ```{r} dg <- expand_tibble(fun = "runif", n = c(10, 20, 30)) pg <- expand_tibble(proc = c("min", "max")) eval_tibbles( data_grid = dg, proc_grid = pg, replications = 1000, summary_fun = list(mean = mean) ) eval_tibbles( data_grid = dg, proc_grid = pg, replications = 1000, summary_fun = list(mean = mean, sd = sd) ) ``` Note, by specifying the parameter *summary_fun* the generated data sets and all individual result objects are discarded. In this example we discard $3 \times 1000$ data sets and $3 \times 1000 \times 2$ individual result objects. ## Parameter *post_analyze* Sometimes the analyzing functions return quite complicated objects like in the Section *A first example*. ```{r} eval_tibbles( expand_tibble(fun = "regData", n = 5L, SD = 1:2), expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")), replications = 2 ) ``` The parameter *post_analyze* (if specified) is applied directly after the result was generated (see command line 10). Note, *purrr::compose* can be very handy if your post-analyzing-function can be defined by a few single functions: ```{r} eval_tibbles( expand_tibble(fun = "regData", n = 5L, SD = 1:2), expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")), post_analyze = purrr::compose(function(mat) mat["(Intercept)", "Estimate"], coef, summary.lm), replications = 2 ) ``` ## Parameter *group_for_summary* When the result object is a *data.frame* itself, for instance ```{r} presever_rownames <- function(mat) { rn <- rownames(mat) ret <- tibble::as_tibble(mat) ret$term <- rn ret } eval_tibbles( expand_tibble(fun = "regData", n = 5L, SD = 1:2), expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")), post_analyze = purrr::compose(presever_rownames, coef, summary), replications = 3 ) ``` In order to summarize the replications it is necessary to additional group the calculations with respect to another variable. This variable can be passed to *group_for_summary* ```{r} eval_tibbles( expand_tibble(fun = "regData", n = 5L, SD = 1:2), expand_tibble(proc = "lm", formula = c("y~x", "y~I(x^2)")), post_analyze = purrr::compose(presever_rownames, coef, summary), summary_fun = list(mean = mean, sd = sd), group_for_summary = "term", replications = 3 ) ``` ## Parameter *ncpus* and *cluster_seed* By specifying *ncpus* larger than 1 a cluster objected is created for the user and passed to the parameter *cluster* discussed in the next section. ```{r} eval_tibbles( data_grid = dg, proc_grid = pg, replications = 10, ncpus = 2, summary_fun = list(mean = mean) ) ``` As it is stated in command line 7, the replications are parallelized. In our case, this means that roughly every CPU conducts 5 replications. The parameter *cluster_seed* must be an integer vector of length 6 and serves the same purpose as the function *set.seed*. By default, *cluster_seed* equals *rep(12345, 6)*. Note, in order to reproduce the simulation study it is also necessary that *ncpus* does not change. ## Parameter *cluster* The user can create a cluster on its own. This also enables the user to distribute the replications over different computers in a network. ```{r} library(parallel) cl <- makeCluster(rep("localhost", 2), type = "PSOCK") eval_tibbles( data_grid = dg, proc_grid = pg, replications = 10, cluster = cl, summary_fun = list(mean = mean) ) stopCluster(cl) ``` As you can see our cluster consists of 3 workers. Hence, this reproduces the results from the last code chunk above. Further note, if the user starts the cluster, the user also has to stop the cluster. A cluster that is created within *eval_tibbles* by specifying *ncpus* is also stop within *eval_tibbles*. ## Parameter *cluster_libraries* and *cluster_global_objects* A newly created cluster is ``empty''. Hence, if the simulation study requires libraries or objects from the global environment, they must be transferred to the cluster. Lets look at standard example from the *boot* package. ```{r} library(boot) ratio <- function(d, w) sum(d$x * w) / sum(d$u * w) city.boot <- boot(city, ratio, R = 999, stype = "w", sim = "ordinary" ) boot.ci(city.boot, conf = c(0.90, 0.95), type = c("norm", "basic", "perc", "bca") ) ``` The following data generating function is extremely boring because it always returns the data set *city* from the library *boot*. ```{r} returnCity <- function() { city } bootConfInt <- function(data) { city.boot <- boot(data, ratio, R = 999, stype = "w", sim = "ordinary" ) boot.ci(city.boot, conf = c(0.90, 0.95), type = c("norm", "basic", "perc", "bca") ) } ``` The function *ratio* exists at the moment only in our global environment. Further we had to load the *boot* package. Hence, we load the *boot* package by setting *cluster_libraries = c("boot")* and transfer the function *ratio* by setting *cluster_global_objects = c("ratio")*. ```{r} dg <- expand_tibble(fun = "returnCity") pg <- expand_tibble(proc = "bootConfInt") eval_tibbles(dg, pg, replications = 10, ncpus = 2, cluster_libraries = c("boot"), cluster_global_objects = c("ratio") ) ``` Of course, it is possible to set *cluster_global_objects=ls()*, but then all objects from the global environment are transferred to all workers. ## Parameter *envir* The function *eval_tibbles* generates in a first step function calls from *data_grid* and *proc_grid*. This is achieved by applying the *R*-function *get*. By default, *envir=globalenv()* and thus *get* searches the global environment of the current R session. An example shows how to use the parameter *envir*. ```{r} # masking summary from the base package summary <- function(x) tibble(sd = sd(x)) g <- function(x) tibble(q0.1 = quantile(x, 0.1)) someFunc <- function() { summary <- function(x) tibble(sd = sd(x), mean = mean(x)) dg <- expand_tibble(fun = "runif", n = 100) pg <- expand_tibble(proc = c("summary", "g")) # the standard is to use the global # environment, hence summary defined outside # of someFunc() will be used print(eval_tibbles(dg, pg)) cat("--------------------------------------------------\n") # will use the local defined summary, but g # from the global environment, because # g is not locally defined. print(eval_tibbles(dg, pg, envir = environment())) } someFunc() ``` ## .truth-functionality Sometimes it is handy to access the parameter constellation that was used during the data generation in the (post) data analyzing phase. Of course, one could write wrapper functions for every data generating function and append the parameter constellation from the data generation as attributes to the data set, but the purpose of this package is to reduce such administrative source code. Hence if the (post) data analyzing function has an argument *.truth*, then *eval_tibbles* will manage that hand-over. A brief example should explain this. Suppose we want to estimate the bias of the empirical quantile estimator if the data is normal distributed. ```{r} dg <- expand_tibble(fun = c("rnorm"), mean = c(1,1000), sd = c(1,10), n = c(10L, 100L)) pg <- expand_tibble(proc = "quantile", probs = 0.975) post_ana <- function(q_est, .truth){ tibble::tibble(bias = q_est - stats::qnorm(0.975, mean = .truth$mean, sd = .truth$sd)) } eval_tibbles(dg, pg, replications = 10^3, discard_generated_data = TRUE, ncpus = 2, post_analyze = post_ana, summary_fun = list(mean = mean)) ``` If we want to do the analysis for different distrubtions we could modify our post data analyzing function, but we can also simply add a *.truth*-column to the data generating grid. In this case, the information from the *.truth*-column is directly passed to the *.truth*-parameter: ```{r} dg <- dplyr::bind_rows( expand_tibble(fun = c("rnorm"), mean = 0, n = c(10L, 100L), .truth = qnorm(0.975)), expand_tibble(fun = c("rexp"), rate = 1, n = c(10L, 100L), .truth = qexp(0.975, rate = 1)), expand_tibble(fun = c("runif"), max = 2, n = c(10L, 100L), .truth = qunif(0.975, max = 2)) ) pg <- expand_tibble(proc = "quantile", probs = 0.975) post_ana <- function(q_est, .truth){ ret <- q_est - .truth names(ret) <- "bias" ret } eval_tibbles(dg, pg, replications = 10^3, discard_generated_data = TRUE, ncpus = 2, post_analyze = post_ana, summary_fun = list(mean = mean)) ``` In the same fashion one could write a data analyzing function with a parameter *.truth*. To go even a step further, we store the analytic quantile function in the *.truth* column: ```{r} dg <- dplyr::bind_rows( expand_tibble(fun = c("rnorm"), mean = 0, n = c(10L, 1000L), .truth = list(function(prob) qnorm(prob, mean = 0))), expand_tibble(fun = c("rexp"), rate = 1, n = c(10L, 1000L), .truth = list(function(prob) qexp(prob, rate = 1))), expand_tibble(fun = c("runif"), max = 2, n = c(10L, 1000L), .truth = list(function(prob) qunif(prob, max = 2))) ) bias_quantile <- function(x, prob, .truth) { est <- quantile(x, probs = prob) ret <- est - .truth[[1]](prob) names(ret) <- "bias" ret } pg <- expand_tibble(proc = "bias_quantile", prob = c(0.9, 0.975)) eval_tibbles(dg, pg, replications = 10^3, discard_generated_data = TRUE, ncpus = 1, summary_fun = list(mean = mean)) ``` But one should keep in mind that if one calculates the quantile during the (post) analyzing phase that this is happens on replication level. To be more precise lets look at an excerpt of the pseudo code from the beginning of the vignette: ```{r, eval = FALSE} 6. for g in {g_1, ..., g_k} 7. for r in 1:replications (optionally in a parallel manner) 8. data = g() 9. for f in {f_1, \ldots, f_L} 10. append f(data) to the result object (optionally apply a post-analyze-function) ``` No matter if one extend the data analyzing function *f_1, ... f_L* or the post-analyze-function with an argument *.truth* the calculation are made for every single replication during step 10. Hence, the operations are not vectorized! # Some Examples Note, the following code examples will use more computational resources. In order to prevent that these are checked/executed on the CRAN check farm, they are only evaluated if the environment variable *NOT_CRAN* is set to "true" ```{r} EVAL <- FALSE if (Sys.getenv("NOT_CRAN") == "true") { EVAL <- TRUE } ``` ## Sampling distribution of mean and median for normal and exponential distributed data First we define how the data is generated, where the sample size should be 10 and 100: ```{r, eval = EVAL} dg <- dplyr::bind_rows( expand_tibble(fun = c("rnorm"), mean = 1, n = c(10L, 100L)), expand_tibble(fun = c("rexp"), rate = 1, n = c(10L, 100L)) ) dg ``` Afterwards we define how we want to analyze the data: ```{r, eval = EVAL} pg <- expand_tibble(proc = c("mean", "median")) pg ``` Finally, we conduct the simulation and visualize the results ```{r, eval = EVAL} et <- eval_tibbles(dg, pg, replications = 10^4, ncpus = 2) et$simulation %>% ggplot(aes(x = results, color = interaction(fun, n), fill = interaction(fun, n))) + geom_density(alpha = 0.3) + facet_wrap(~ proc) ``` ## Comparing bootstrap confidence intervals with classical studentized interval We want to compare the confidence intervals that are generated by *boot::boot.ci()* and *stats::t.test()*. Unfortunately, *boot::boot.ci()* cannot be applied directly to the generated data sets. Therefore, we write a new function: ```{r, eval = EVAL} bootstrap_ci <- function(x, conf.level, R = 999) { b <- boot::boot(x, function(d, i) { n <- length(i) c( mean = mean(d[i]), variance = (n - 1) * var(d[i]) / n^2 ) }, R = R) boot::boot.ci(b, conf = conf.level, type = "all") } ``` Furthermore, *boot::boot.ci()* returns in general more than one confidence interval and the structures returned by *boot::boot.ci()* and *stats::t.test()* are also very different. One solution could be to write a function *t_test()* that calls *stats::t.test()*, modifies the returned object and additionally modify the function *bootstrap_ci* so that both function return objects with a unified structure. But instead of that we will implement a function that is later on passed to the argument *post_analyze* of *eval_tibbles*: ```{r, eval = EVAL} post_analyze <- function(o, .truth) { if (class(o) == "htest") { #post-process the object returned by t.test ci <- o$conf.int return(tibble::tibble( type = "t.test", aspect = c("covered", "ci_length"), value = c(ci[1] <= .truth && .truth <= ci[2], ci[2] - ci[1]) )) } else if (class(o) == "bootci") { #post-process the object returned by boot.ci method = c("normal", "basic", "student", "percent", "bca") ret = o[method] lower = unlist(purrr::map(ret, ~dplyr::nth(.x, -2))) upper = sapply(ret, dplyr::last) type = paste("boot", method, sep = "_") return( dplyr::bind_rows( tibble::tibble( type = type, aspect = "covered", value = as.integer(lower <= .truth & .truth <= upper)), tibble::tibble( type = type, aspect = "ci_length", value = upper - lower)) ) } } ``` As you can see, the objects returned are tibbles with more than one row. Summarizing the data over all replications will in general use the variable *type* and *aspect* as grouping variables. This can be achieved by using the parameter *group_for_summary* of *eval_tibbles*. ### Define and run the simulation We want to generate normal-, uniform-, and exponential distributed data: ```{r, eval = EVAL} dg <- dplyr::bind_rows( simTool::expand_tibble(fun = "rnorm", n = 10L, mean = 0, sd = sqrt(3), .truth = 0), simTool::expand_tibble(fun = "runif", n = 10L, max = 6, .truth = 3), simTool::expand_tibble(fun = "rexp", n = 10L, rate = 1 / sqrt(3), .truth = sqrt(3)) ) dg ``` and apply our functions that calculate the confidence intervals to it using two different confidence levels. ```{r, eval = EVAL} pg <- simTool::expand_tibble( proc = c("t.test","bootstrap_ci"), conf.level = c(0.9, 0.95) ) pg ``` Note, that the structure of the objects returned are quite different which addressed by our function *post_analyze*. The variables *type* and *aspect* that are created by *post_analyze* are to distinguish the different confidence intervals. Since these variables are part of the result objects, *eval_tibbles* assumes that these variables are results. In order to summarize the results (calculating the mean) over all replications correctly we need to tell eval_tibbles that additional group variables are *type* and *aspect*: ```{r, eval = EVAL} et <- eval_tibbles(dg, pg, replications = 10^3, ncpus = 2, cluster_global_objects = "post_analyze", post_analyze = post_analyze, summary_fun = list(mean = mean), group_for_summary = c("aspect", "type") ) et ``` Finally, we can visualize the summarized results: ```{r, eval = EVAL, fig.width=13} et$simulation %>% ggplot(aes(x = fun, y = value, group = type, fill = type, label = round(value, 2))) + geom_col(position = "dodge") + geom_label(position = position_dodge(0.9), size = 3) + theme(legend.position = "bottom") + facet_grid(aspect ~ conf.level, scales = "free") ``` ### A different implementation Here we briefly realize the simulation differently by leveraging data analyzing functions with unified return-objects: ```{r, eval = EVAL} t_test = function(x, conf.level){ tt <- t.test(x, conf.level = conf.level) # unify return tibble::tibble(type = "t.test", lower = tt$conf.int[1], upper = tt$conf.int[2]) } bootstrap_ci <- function(x, conf.level, R = 999) { b <- boot::boot(x, function(d, i) { n <- length(i) c( mean = mean(d[i]), variance = (n - 1) * var(d[i]) / n^2 ) }, R = R) ci <- boot::boot.ci(b, conf = conf.level, type = "all") method = c("normal", "basic", "student", "percent", "bca") ret = ci[method] lower = unlist(purrr::map(ret, ~dplyr::nth(.x, -2))) upper = sapply(ret, dplyr::last) type = paste("boot", method, sep = "_") # unify return tibble::tibble(type = type, lower = lower, upper = upper) } dg <- dplyr::bind_rows( simTool::expand_tibble(fun = "rnorm", n = 10L, mean = 0, sd = sqrt(3), .truth = 0), simTool::expand_tibble(fun = "runif", n = 10L, max = 6, .truth = 3), simTool::expand_tibble(fun = "rexp", n = 10L, rate = 1 / sqrt(3), .truth = sqrt(3)) ) pg <- simTool::expand_tibble( proc = c("t_test","bootstrap_ci"), conf.level = c(0.9, 0.95) ) %>% mutate(R = ifelse(proc == "bootstrap_ci", 100, NA)) et <- eval_tibbles(dg, pg, replications = 10^2, ncpus = 2 ) et ``` ```{r, eval = EVAL} grps <- et$simulation %>% select(-replications) %>% select(fun:type) %>% names et$simulation %>% mutate(covered = lower <= .truth & .truth <= upper, ci_length = upper - lower) %>% group_by(.dots = grps) %>% summarise(coverage = mean(covered), ci_length = mean(ci_length)) ```