# Bootstrapping, Resampling, Uncertainty Visualization

Watching Claus Wilke’s rstudio::conf2019 talk on his new ungeviz r package to visualize uncertainty with ggplot2, I realized that I only have a fuzzy idea what bootstrapping is. For some reason I didn’t find the wikipedia entry particularly illuminating for what I assumed to be a simple procedure. Also it is Sunday and I wanted to watch a video rather than read something.

## Resampling and Bootstrapping

Looking for someone to explain this to me in terms I can understand on a Sunday I found Christina Knudson’s youtube channel where she does exactly that in small, easily digestible chunks. Thanks! After watching this 5 part series on bootstrapping is turns out that bootstrapping really just means resampling a random sample with replacement at the same size as the original sample.

Or, in the words of the rsample help: A bootstrap sample is a sample that is the same size as the original data set that is made using replacement. This results in analysis samples that have multiple replicates of some of the original rows of the data.

Bootstrapping is used, for example, to estimate the uncertainty of statsitical properties of a sample without having to make a lot of assumptions about it (e.g. distribution type, sample size, etc).

There are lots of different r packages to do resampling and bootstrapping but it so happens that Alex Hayes, in his talk about the spectacular broom package at the same rstudio conference uses the rsample package from the tidymodels family to bootstrap. So this seems a good choice.

With this, creating a numer of bootstrap samples using the `mtcars` dataset is straightforward.

``````library(tidyverse)
library(rsample)

samples = 1000
boots = bootstraps(mtcars, times=samples)
``````## # A tibble: 6 x 2
##   splits          id
## * <list>          <chr>
## 1 <split [32/14]> Bootstrap0001
## 2 <split [32/13]> Bootstrap0002
## 3 <split [32/10]> Bootstrap0003
## 4 <split [32/12]> Bootstrap0004
## 5 <split [32/13]> Bootstrap0005
## 6 <split [32/8]>  Bootstrap0006``````

`boots` is a compact but somewhat daunting return value. It is a tibble that contains a list column (splits) and an id column. The id column simply holds an identifier for the bootstrap resample. `splits` is a list of length 4 where the first element points to the original data set, the second holds the indices of the resample and the rest are irrelevant (at this point). So the pointer to the original data frame in the first row could be accessed like this `boots\$splits[[1]]\$data`. This is really fast because (as I understand it) nothing is created or copied yet. We have a pointer to the original data and a vector of integers with the sampled indices.

To replicate Christina’s example in the video, I want to simply bootstrap to estimate the density distribution of the mean of the `wt` variable. To do this, I need to calculate the mean for `wt` for each bootstrap of the `1000` samples.

Because I want to do this for each resample and `boots\$splits` is a list column the purrr package will be helpful if I want to avoid writing lots of loops or, god forbid, use the `l/s/apply` functions which I have regrettably never become friends with. If you are unfamiliar with `purrr`, Charlotte Wickham’s tutorial is a great place to start, especially considering that I am only linking videos today.

The first step in figuring out how to map something with `purrr` is to figure it out for a single row. What I am doing now comes with no guarantees for efficiency. What I want is the actual data frame for the resample (not just an index vector and a pointer to the original data currently in `splits`). Conveniently, the somewhat cryptically named `analysis` function in the `rsample` package does just that.

``````bootstrap_sample = analysis(boots\$splits[[1]])
``````##                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Merc 280C        17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 230         22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Toyota Corona    21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9        27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## Lotus Europa     30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2``````

So calculating the mean of `wt`in the first resample would work simply like this:

``mean(analysis(boots\$splits[[1]])\$wt)``
``## [1] 2.929``

To generalize this to the `purrr` recipe for all resamples I simply replace the first element `[[1]]` by `.x`.

``````boots_mean = boots %>%
mutate(mn = map_dbl(splits, ~ mean(analysis(.x)\$wt)))

``````## # A tibble: 6 x 3
##   splits          id               mn
## * <list>          <chr>         <dbl>
## 1 <split [32/9]>  Bootstrap0001  3.46
## 2 <split [32/12]> Bootstrap0002  3.10
## 3 <split [32/13]> Bootstrap0003  3.55
## 4 <split [32/13]> Bootstrap0004  2.96
## 5 <split [32/8]>  Bootstrap0005  3.06
## 6 <split [32/13]> Bootstrap0006  3.14``````

Having added the column containing all the bootstrap sample means, I can use ggplot to approximate the sample distribution of the mean.

``````boots_mean %>%
ggplot(aes(mn)) +
geom_density() +
labs(title = "Approximated sampling distribution of mean wt",
x = "mean wt",
y = "frequency") +
theme_minimal()``````

For completeness sake I am also replicating Alex’ example here where he uses boostrapping and the `augment` function of the broom package to visualize the uncertainty of a model fit.

His example is the relationship between `mpg` and `wt` in `mtcars`.

``````mtcars %>%
ggplot(aes(wt,mpg)) +
geom_point() +
theme_minimal()``````

Again, first working things out for a single case to feed to `purrr`, we construct the helper function `fit_nls_on_bootstrap` that fits a nonlinear model to `boots`. We also map tidy model coefficients `k` and `b`.

``````fit_nls_on_bootstrap <- function(split) {
nls(
mpg ~ k / wt + b,
analysis(split),
start = c(k = 1, b = 0)
)
}

boots_fit = boots %>%
mutate(fit = map(splits, fit_nls_on_bootstrap),
coef_info = map(fit, tidy))

``````## # A tibble: 6 x 4
##   splits          id            fit       coef_info
## * <list>          <chr>         <list>    <list>
## 1 <split [32/14]> Bootstrap0001 <S3: nls> <tibble [2 × 5]>
## 2 <split [32/13]> Bootstrap0002 <S3: nls> <tibble [2 × 5]>
## 3 <split [32/10]> Bootstrap0003 <S3: nls> <tibble [2 × 5]>
## 4 <split [32/12]> Bootstrap0004 <S3: nls> <tibble [2 × 5]>
## 5 <split [32/13]> Bootstrap0005 <S3: nls> <tibble [2 × 5]>
## 6 <split [32/8]>  Bootstrap0006 <S3: nls> <tibble [2 × 5]>``````

This returns an even more frightening thing than before. Fortunately, we can use `unnest` to flatten this horrendous thing into something more sane.

``````boots_coefs = boots_fit %>%
unnest(coef_info)

``````## # A tibble: 6 x 6
##   id            term  estimate std.error statistic  p.value
##   <chr>         <chr>    <dbl>     <dbl>     <dbl>    <dbl>
## 1 Bootstrap0001 k        43.3       5.20      8.33 2.69e- 9
## 2 Bootstrap0001 b         5.72      1.93      2.96 5.92e- 3
## 3 Bootstrap0002 k        46.3       4.12     11.2  2.91e-12
## 4 Bootstrap0002 b         3.91      1.42      2.75 1.00e- 2
## 5 Bootstrap0003 k        38.9       3.93      9.89 5.95e-11
## 6 Bootstrap0003 b         6.64      1.45      4.59 7.31e- 5``````

And plot the histograms for k and b:

``````boots_coefs %>%
ggplot(aes(estimate)) +
geom_histogram(binwidth = 2) +
facet_wrap(~ term, scales = "free") +
labs(title = "Approximated sampling distributions of k and b",
x = "value",
y = "count") +
theme_minimal()``````

Finally, we use `broom::augment` to get fitted values from each model and plot the fits of the first 100 bootstrap samples.

``````library(broom)
boot_aug = boots_fit %>%
mutate(augmented = map(fit, augment)) %>%
unnest(augmented)

boot_aug %>%
filter(id %in% unique(boot_aug\$id)[1:100]) %>%
ggplot(aes(wt, mpg)) +
geom_point() +
geom_line(aes(y=.fitted, group=id),alpha=0.1) +
theme_minimal()``````

Sadly, it seems the ungeviz package has to wait for another day…

##### Peter Paul Pichler
###### Senior Scientist

My research interests include social metabolism, well-being on a finite planet and complex adaptive systems.