The non-parametric bootstrap was my first love. I was lost in a muddy swamp of zs, ts and ps when I first saw her. Conceptually beautiful, simple to implement, easy to understand (I thought back then, at least). And when she whispered in my ear, “I make no assumptions regarding the underlying distribution”, I was in love. This love lasted roughly a year, but the more I learned about statistical modeling, especially the Bayesian kind, the more suspect I found the bootstrap. It is most often explained as a procedure, not a model, but what are you actually assuming when you “sample with replacement”? And what is the underlying model?
Still, the bootstrap produces something that looks very much like draws from a posterior and there are papers comparing the bootstrap to Bayesian models (for example, Alfaro et al., 2003). Some also wonder which alternative is more appropriate: Bayes or bootstrap? But these are not opposing alternatives, because the non-parametric bootstrap is a Bayesian model.
In this post I will show how the classical non-parametric bootstrap of Efron (1979) can be viewed as a Bayesian model. I will start by introducing the so-called Bayesian bootstrap and then I will show three ways the classical bootstrap can be considered a special case of the Bayesian bootstrap. So basically this post is just a rehash of Rubin’s The Bayesian Bootstrap from 1981. Some points before we start:
Just because the bootstrap is a Bayesian model doesn’t mean it’s not also a frequentist model. It’s just different points of view.
Just because it’s Bayesian doesn’t necessarily mean it’s any good. “We used a Bayesian model” is as much a quality assurance as “we used probability to calculate something”. However, writing out a statistical method as a Bayesian model can help you understand when that method could work well and how it can be made better (it sure helps me!).
Just because the bootstrap is sometimes presented as making almost no assumptions, doesn’t mean it does. Both the classical non-parametric bootstrap and the Bayesian bootstrap make very strong assumptions which can be pretty sensible and/or weird depending on the situation.
Let’s start with describing the Bayesian bootstrap of Rubin (1981), which the classical bootstrap can be seen as a special case of. Let $d = (d_1, \ldots, d_K)$ be a vector of all the possible values (categorical or numerical) that the data $x = (x_1, \ldots, x_N)$ could possibly take. It might sound strange that we should be able to enumerate all the possible values the data can take, what if the data is measured on a continuous scale? But, as Rubin writes, “[this] is no real restriction because all data as observed are discrete”. Then, each $x_i$ is modeled as being drawn from the $d$ possible values where the probability of $x_i$ receiving a certain value from $d$ depends on a vector of probabilities $\pi = (\pi_1, \ldots, \pi_K)$, where $\pi_1$ is the probability of drawing $d_1$. Using a categorical distribution, we can write it like this:
$$% <![CDATA[ \begin{align} &\begin{array}{l} x_i = d_{k_i}\ k_i \sim \text{Categorical}(\pi) \end{array} \bigg\} ~ \text{for $i$ in $1..N$} \ \end{align} %]]>$$
Now we only need a prior distribution over the $\pi$s for the model to be complete. That distribution is the Dirichlet distribution which is a distribution over proportions. That is, the Dirichlet is a multivariate distribution which has support over vectors of real numbers between 0.0 and 1.0 that together sums to 1.0 . A 2-dimensional Dirichlet is the same as a Beta distribution and is defined on the line where $\pi_1 + \pi_2$ is always 1, the 3-dimensional Dirichlet is defined on the triangle where $\pi_1 + \pi_2 + \pi_3$ is always 1, and so on. A $K$-dimensional Dirichlet has $K$ parameters $\alpha = (\alpha_1, \ldots, \alpha_K)$ where the expected proportion of, say, $\pi_1$ is $\alpha_1 / \sum \alpha_{1..K}$ . The higher the sum of all the $\alpha$s, the more the distribution concentrates on the expected proportion. If instead $\sum \alpha_{1..K}$ approaches 0, the distribution concentrates on points with few large proportions. This behavior is illustrated below using draws from a 3-dimensional Dirichlet where $\alpha$ is set to different values and where red means higher density:
When $\alpha = (1,1,1)$ the distribution is uniform, any combination of $(\pi_1,\pi_2,\pi_3)$ that forms a proportion is equally likely. But as $\alpha \rightarrow (0, 0, 0)$ the density is “pushed” towards the edges of the triangle making proportions like $(033, 0.33, 0.33)$ very unlikely in favor of proportions like $(0.9, 0.1, 0.0)$ and $(0.0, 0.5, 0.5)$. We want a Dirichlet distribution of this latter type, a distribution that puts most of the density over combination of proportions where most of the $\pi$s are zero and only few $\pi$s are large. Using this type of prior will make the model consider it very likely apriori that most of the data $x$ is produced from a small number of the possible values $d$. And in the limit when $\alpha = (0, \ldots, 0)$ the model will consider it impossible that $x$ takes on more than one value in $d$ unless there is data that shows otherwise. So using a $\text{Dirichlet}(0_1, \ldots, 0_K)$ over $\pi$ achieves the hallmark of the bootstrap, that the model only considers values already seen in the data as possible. The full model is then:
$$% <![CDATA[ \begin{align} &\begin{array}{l} x_i \leftarrow d_{k_i}\ k_i \sim \text{Categorical}(\pi) \end{array} \bigg\} ~ \text{for $i$ in $1..N$} \ &\pi \sim \text{Dirichlet}(0_1, \ldots, 0_K) \end{align} %]]>$$
So is this a reasonable model? Surprise, surprise: It depends. For binary data, $d = (0, 1)$, the Bayesian bootstrap is the same as assuming $x_i \sim \text{Bernoulli}(p)$ with an improper $p \sim \text{Beta(0,0)}$ prior. A completely reasonable model, if you’re fine with the non-informative prior. Similarly it reduces to a categorical model when $d$ are a number of categories. For integer data, like count data, the Bayesian bootstrap implies treating each possible value as its own isolated category, disregarding any prior information regarding a relation between the values (such that three eggs are more that two eggs, but less than four). For continuous data the assumptions of the bootstrap feel a bit weird because we are leaving out obvious information: That the data is actually continuous and that a data point of, say, 3.1 should inform the model that values that are close (like 3.0 and 3.2) are also more likely.
If you don’t include useful prior information in a model you will have to make up with information from another source in order to get as precise estimates. This source is often the data, which means you might need relatively more data when using the bootstrap. You might say that the bootstrap makes very naïve assumptions, or perhaps very conservative assumptions, but to say that the bootstrap makes no assumptions is wrong. It makes really strong assumptions: The data is discrete and values not seen in the data are impossible.
So let’s take the Bayesian bootstrap for a spin by using the cliché example of inferring a mean. I’ll compare it with using the classical non-parametric bootstrap and a Bayesian model with flat priors that assumes that the data is normally distributed. To implement the Bayesian bootstrap I’m using this handy script published at R-snippets.
Compared to the “gold standard” of the Normal Bayesian model both the classical and the Bayesian bootstrap have shorter tails, otherwise they are pretty spot on. Note also that the two bootstrap distributions are virtually identical. Here, and in the model definition, the data $x_i$ was one-dimensional, but it’s easy to generalize to bi-variate data by replacing $x_i$ with $(x_{1i}, x_{2i})$ (and similar for multivariate data).
I feel that, in the case of continuous data, the specification of the Bayesian bootstrap as given above is a bit strange. Sure, “all data as observed are discrete”, but it would be nice with a formulation of the Bayesian bootstrap that fits more natural with continuous data.
The Bayesian bootstrap can be characterized differently than the version given by Rubin (1981). The two versions result in the exact same inferences but I feel that the second version given below is a more natural characterization when the data is assumed continuous. It is also very similar to a Dirichlet process which means that the connection between the Bayesian bootstrap and other Bayesian non-parametric methods is made clearer.
This second characterization requires two more distributions to get going: The Dirac delta distribution and the geometric distribution. The Dirac delta distribution is so simple that is almost doesn’t feel like a distribution at all. It is written $x \sim \delta(x_0)$ and is a probability distribution with zero density except at $x_0$. Assuming, say, $x \sim \delta(5)$ is basically the same as saying that $x$ is fixed at 5. The delta distribution can be seen be seen as a Normal distribution where the standard deviation is approaching zero, as this animation off Wikipedia nicely demonstrates:
The geometric distribution is the distribution over how many “failures” there are in a number of Bernoulli trials before there is a success, where the one parameter is $p$, the relative frequency of success. Here are some geometric distributions with different $p$:
We’ll begin this second version of the Bayesian bootstrap by assuming that the data $x = (x_1, \ldots, x_N)$ is distributed as a mixture of $\delta$ distributions with $M$ components, where $\mu = (\mu_1, \ldots, \mu_M)$ are the parameters of the $\delta$s. The $\mu$s are given a flat $\text{Uniform}(-\infty,\infty)$ distribution and the mixture probabilities $\pi = (\pi_1, \ldots, \pi_M)$ are again given a $\text{Dirichlet}(0_1, \ldots,0_M)$ distribution. Finally, $M$, the number of component distributions in the mixture, is given a $\text{Geometric}(p)$ distribution where $p$ is close to 1. Here is the full model:
$$% <![CDATA[ \begin{align} &\begin{array}{l} x_i \sim \delta(\mu_{k_i}) \ k_i \sim \text{Categorical}(\pi) \end{array} \bigg\} &\text{for $i$ in $1..N$} \ &\mu_j \sim \text{Uniform}(-\infty,\infty) &\text{for $j$ in $1..M$} \ &\pi \sim \text{Dirichlet}(0_1, \ldots, 0_M) & ~ \ &M \sim \text{Geometric}(p) &\text{with $p$ close to 1} \end{align} %]]>$$
There is more going on in this version of the Bayesian bootstrap, but what the model is basically assuming is this: The data comes from a limited number of values (the $\mu$s) where each value can be anywhere between $-\infty$ and $\infty$. A data point ($x_i$) comes from a specific value ($\mu_j$) with a probability ($\pi_j$), but what these probabilities $(\pi_1, \ldots, \pi_M)$ are is very uncertain (due to the Dirichlet prior). The only part that remains is how many values ($M$) the data is generated from. This is governed by the Geometric distribution where $p$ can be seen as the probability that the current number of values ($M$) is the maximum number needed. When $p \approx 1$ the number of values will be kept to a minimum unless there is overwhelming evidence that another value is needed. But since the data is distributed as a pointy Dirac $\delta$ distribution a set of data of, say, $x = (3.4, 1.2, 4.1)$ is overwhelming evidence of that $M$ is at least 3 as there is no other possible way $x$ could take on three different values.
So, I like this characterization of the Bayesian bootstrap because it connects to Bayesian non-parametrics and it is more easy for me to see how it can be extended. For example, maybe you think the Dirac $\delta$ distribution is unreasonably sharply peaked? Then just swap it for a distribution that better matches what you know about the data (a Normal distribution comes to mind). Do you want to include prior information regarding the location of the data? Then swap the $\text{Uniform}(-\infty,\infty)$ for something more informative. Is it reasonable to assume that there are between five and ten clusters / component distributions? Then replace the geometric distribution with a $\text{Discrete-Uniform}(5, 10)$. And so on. If you want to go down this path you should read up on Bayesian non-parametrics (for example, this tutorial). Actually, for you that are already into this, a third characterization of the Bayesian bootstrap is as a Dirichlet process with $\alpha \rightarrow 0$ (Clyde and Lee, 2001).
Again, the bootstrap is a very “naïve” model. A personification of the bootstrap would be a savant learning about peoples lengths, being infinitely surprised by each new length she observed. “Gosh! I knew people can be 165 cm or 167 cm, but look at you, you are 166 cm, who knew something like that was even possible?!”. However, while it will take many many examples, Betty Bootstrap will eventually get a pretty good grip on the distribution of lengths in the population. Now that I’ve written at length about the Bayesian bootstrap, what is its relation with the classical non-parametric bootstrap?
I can think of three ways the classical bootstrap of Efron (1979) can be considered a special case of the Bayesian bootstrap. Just because the classical bootstrap can be considered a special case doesn’t mean it is necessarily “better” or “worse”. But, from a Bayesian perspective, I don’t see how the classical bootstrap has any advantage over the Bayesian (except for being computationally more efficient, easier to implement and perhaps more well know by the target audience of the analysis…). So in what way is the classical bootstrap a special case?
When implemented by Monte Carlo methods, both the classical and Bayesian bootstrap produces draws that can be interpreted as probability weights over the input data. The classical bootstrap does this by “sampling with replacement” which is another way of saying that the weights $\pi = (\pi_1, \ldots, \pi_n)$ for the $N$ data points are created by drawing counts $c = (c_1, \ldots, c_N)$ from a $\text{Multinomial}(p_1, \ldots, p_N)$ distribution with $N$ trials where all $p$s = $1/N$. Each count is then normalized to create the weights: $\pi_i = c_i / N$. For example, say we have five data points, we draw from a Multinomial and get $(0, 2, 2, 1, 0)$ which we normalize by dividing by five to get the weights $(0, 0.4, 0.4, 0.2, 0)$. With the Bayesian bootstrap, the $N$ probability weights can instead be seen as being drawn from a flat $\text{Dirichlet}(1_1, \ldots, 1_N)$ distribution. This follows directly from the model definition of the Bayesian bootstrap and an explanation for why this is the case can be found in Rubin (1981). For example, for our five data points we could get weights $(0, 0.4, 0.4, 0.2, 0)$ or $(0.26,0.1,0.41,0.01,0.22)$.
Setting aside philosophical differences, the only difference between the two methods is in how the weights are drawn, and both methods result in very similar weight distributions. The mean of either weight distributions is the same, each probability weight $\pi_j$ has a mean of $1/N$ both when using the Multinomial and the Dirichlet. The variance of the weights are almost the same. For the classical bootstrap the variance is $(n + 1) / n$ times the variance for the bootstrap weights and this difference grows small very quickly as $n$ gets large. These similarities are presented in Rubin’s original paper on the Bayesian bootstrap and discussed in a friendly manner by Friedman, Hastie and Tibshirani (2009) on p. 271.
From a Bayesian perspective I find three things that are slightly strange with how the weights are drawn in the classical bootstrap (but from a sampling distribution perspective it makes total sense, of course):
Let’s try to visualize the difference between the two versions of the bootstrap! Below is a graph where each colored column is a draw of probability weights, either from a Dirichlet distribution (to the left) or using the classical resampling scheme (to the right). The first row shows the case with two data points ($N = 2$). Here the difference is large, draws from the Dirichlet vary smoothly between 0% and 100% while the resampling weights are either 0%, 50% or 100%, with 50% being roughly twice as common. However, as the number of data points increases, the resampling weights vary more smoothly and become more similar to the Dirichlet weights.
This difference in how the weights are sampled can also be seen when comparing the resulting distributions over the data. Below, the classical and the Bayesian bootstrap are used to infer a mean when applied to two, four and eight samples from a $\text{Normal(0, 1)}$ distribution. At $N = 2$ the resulting distributions look very different, but they look almost identical already at $N = 8$. (One reason for this is because we are inferring a mean, other statistics could require many more data points before the two bootstrap methods “converge”.)
It is sometimes written that “the Bayesian bootstrap can be thought of as a smoothed version of the Efron bootstrap” (Lancaster, 2003), but you could equally well think of the classical bootstrap as a rough version of the Bayesian bootstrap! Nevertheless, as $N$ gets larger the classical bootstrap quickly becomes a good approximation to the Bayesian bootstrap, and similarly the Bayesian bootstrap quickly becomes a good approximation to the classical one.
Above we saw a connection between the Bayesian bootstrap and the classical bootstrap procedure, that is, using sampling with replacement to create a distribution over some statistic. But you can also show the connection between the models underlying both methods. For the classical bootstrap the underlying model is that the distribution of the data is the distribution of the population. For the Bayesian bootstrap the values in the data define the support of the predictive distribution, but how much each value contributes to the predictive depends on the probability weights which are, again, distributed as a $\text{Dirichlet}(1, \ldots, 1)$ distribution. If we discard the uncertainty in this distribution by taking a point estimate of the probability weights, say the posterior mean, we end up with the following weights: $(1/N, \ldots, 1/N)$. That is, each data point contributes equally to the posterior predictive, which is exactly the assumption of the classical bootstrap. So if you just look at the underlying models, and skip that part where you simulate a sampling distribution, the classical bootstrap can be seen as the posterior mean of the Bayesian bootstrap.
The model of the classical bootstrap can also be put as a special case of the model for the Bayesian bootstrap, version two. In that model the probability weights $\pi = (\pi_1, \ldots, \pi_M)$ were given an uninformative $\text{Dirichlet}(\alpha_1, \ldots,\alpha_M)$ distribution with $\alpha = 0$. If we would increase $\alpha$ then combinations with more equal weights would become successively more likely:
In the limit of $\alpha \rightarrow \infty$, the only possible weight becomes $\pi = (1/M, \ldots, 1/M)$, that is, the model is “convinced” that all seen values contribute exactly equally to the predictive distribution. That is, the same assumption as in the classical bootstrap! Note that this only works if all seen data points are unique (or assumed unique) as would most often be the case with continuous data.
Let’s apply the $\text{Dirichlet}(\infty, \ldots,\infty)$ version of the classical bootstrap to 30 draws from a $\text{Normal}(0, 1)$ distribution. The following animation then illustrates the uncertainty by showing draws from the posterior predictive distribution:
He he, just trolling you. Due to the $\text{Dirichlet}(\infty, \ldots,\infty)$ prior there is no uncertainty at all regarding the predictive distribution. Hence the “animation” is a still image. Let’s apply the Bayesian bootstrap to the same data. The following (actual) animation shows the uncertainty by plotting 200 draws from the posterior predictive distribution:
I like the non-parametric bootstrap, both the classical and the Bayesian version. The bootstrap is easy to explain, easy to run and often gives reasonable results (despite the somewhat weird model assumptions). From a Bayesian perspective it is also very natural to view the classical Bootstrap as an approximation to the Bayesian bootstrap. Or as Friedman et al (2009, p. 272) put it:
In this sense, the bootstrap distribution represents an (approximate) nonparametric, noninformative posterior distribution for our parameter. But this bootstrap distribution is obtained painlessly — without having to formally specify a prior and without having to sample from the posterior distribution. Hence we might think of the bootstrap distribution as a “poor man’s” Bayes posterior. By perturbing the data, the bootstrap approximates the Bayesian effect of perturbing the parameters, and is typically much simpler to carry out.
You can also view the Bayesian bootstrap as a “poor man’s” model. A model that makes very weak assumptions (weak as in uninformative), but that can be used in case you don’t have the time and resources to come up with something better. However, it is almost always possible to come up with a model that is better than the bootstrap, or as Donald B. Rubin (1981) puts it:
[…] is it reasonable to use a model specification that effectively assumes all possible distinct values of X have been observed?
No, probably not.
Alfaro, M. E., Zoller, S., & Lutzoni, F. (2003). Bayes or bootstrap? A simulation study comparing the performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence. ‘Molecular Biology and Evolution, 20(2), 255-266. pdf
Clyde, M. A., & Lee, H. K. (2001). Bagging and the Bayesian bootstrap. In Artificial Intelligence and Statistics. pdf
Efron, B. (1979). Bootstrap methods: another look at the jackknife. The annals of Statistics, 1-26. pdf
Friedman, J., Hastie, T., & Tibshirani, R. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Freely available at http://www-stat.stanford.edu/~tibs/ElemStatLearn/ .
Gershman, S. J., & Blei, D. M. (2012). A tutorial on Bayesian nonparametric models. Journal of Mathematical Psychology, 56(1), 1-12. pdf
Lancaster, T. (2003). A note on bootstraps and robustness. SSRN 896764. pdf
Rubin, D. B. (1981). The Bayesian Bootstrap. The annals of statistics, 9(1), 130-134. pdf
]]>Everybody loves speed comparisons! Is R faster than Python? Is dplyr
faster than data.table
? Is STAN faster than JAGS? It has been said that speed comparisons are utterly meaningless, and in general I agree, especially when you are comparing apples and oranges which is what I’m going to do here. I’m going to compare a couple of alternatives to lm()
, that can be used to run linear regressions in R, but that are more general than lm()
. One reason for doing this was to see how much performance you’d loose if you would use one of these tools to run a linear regression (even if you could have used lm()
). But as speed comparisons are utterly meaningless, my main reason for blogging about this is just to highlight a couple of tools you can use when you grown out of lm()
. The speed comparison was just to lure you in. Let’s run!
Below are the seven different methods that I’m going to compare by using each method to run the same linear regression. If you are just interested in the speed comparisons, just scroll to the bottom of the post. And if you are actually interested in running standard linear regressions as fast as possible in R, then Dirk Eddelbuettel has a nice post that covers just that.
lm()
This is the baseline, the “default” method for running linear regressions in R. If we have a data.frame
d
with the following layout:
head(d)
## y x1 x2
## 1 -64.579 -1.8088 -1.9685
## 2 -19.907 -1.3988 -0.2482
## 3 -4.971 0.8366 -0.5930
## 4 19.425 1.3621 0.4180
## 5 -1.124 -0.7355 0.4770
## 6 -12.123 -0.9050 -0.1259
Then this would run a linear regression with y
as the outcome variable and x1
and x2
as the predictors:
lm(y ~ 1 + x1 + x2, data=d)
##
## Call:
## lm(formula = y ~ 1 + x1 + x2, data = d)
##
## Coefficients:
## (Intercept) x1 x2
## -0.293 10.364 21.225
glm()
This is a generalization of lm()
that allows you to assume a number of different distributions for the outcome variable, not just the normal distribution as you are stuck with when using lm()
. However, if you don’t specify any distribution glm()
will default to using a normal distribution and will produce output identical to lm()
:
glm(y ~ 1 + x1 + x2, data=d)
##
## Call: glm(formula = y ~ 1 + x1 + x2, data = d)
##
## Coefficients:
## (Intercept) x1 x2
## -0.293 10.364 21.225
##
## Degrees of Freedom: 29 Total (i.e. Null); 27 Residual
## Null Deviance: 13200
## Residual Deviance: 241 AIC: 156
bayesglm()
Found in the arm
package, this is a modification of glm
that allows you to assume custom prior distributions over the coefficients (instead of the implicit flat priors of glm()
). This can be super useful, for example, when you have to deal with perfect separation in logistic regression or when you want to include prior information in the analysis. While there is bayes in the function name, note that bayesglm()
does not give you the whole posterior distribution, only point estimates. This is how to run a linear regression with flat priors, which should give similar results as when using lm()
:
library(arm)
bayesglm(y ~ 1 + x1 + x2, data = d, prior.scale=Inf, prior.df=Inf)
##
## Call: bayesglm(formula = y ~ 1 + x1 + x2, data = d, prior.scale = Inf,
## prior.df = Inf)
##
## Coefficients:
## (Intercept) x1 x2
## -0.293 10.364 21.225
##
## Degrees of Freedom: 29 Total (i.e. Null); 30 Residual
## Null Deviance: 13200
## Residual Deviance: 241 AIC: 156
nls()
While lm()
can only fit linear models, nls()
can also be used to fit non-linear models by least squares. For example, you could fit a sine curve to a data set with the following call: nls(y ~ par1 + par2 * sin(par3 + par4 * x ))
. Notice here that the syntax is a little bit different from lm()
as you have to write out both the variables and the parameters. Here is how to run the linear regression:
nls(y ~ intercept + x1 * beta1 + x2 * beta2, data = d)
## Nonlinear regression model
## model: y ~ intercept + x1 * beta1 + x2 * beta2
## data: d
## intercept beta1 beta2
## -0.293 10.364 21.225
## residual sum-of-squares: 241
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 3.05e-08
mle2()
In the bblme
package we find mle2()
, a function for general maximum likelihood estimation. While mle2()
can be used to maximize a handcrafted likelihood function, it also has a formula interface which is simple to use, but powerful, and that plays nice with R’s built in distributions. Here is how to roll a linear regression:
library(bbmle)
inits <- list(log_sigma = rnorm(1), intercept = rnorm(1),
beta1 = rnorm(1), beta2 = rnorm(1))
mle2(y ~ dnorm(mean = intercept + x1 * beta1 + x2 * beta2, sd = exp(log_sigma)),
start = inits, data = d)
##
## Call:
## mle2(minuslogl = y ~ dnorm(mean = intercept + x1 * beta1 + x2 *
## beta2, sd = exp(log_sigma)), start = inits, data = d)
##
## Coefficients:
## log_sigma intercept beta1 beta2
## 1.0414 -0.2928 10.3641 21.2248
##
## Log-likelihood: -73.81
Note, that we need to explicitly initialize the parameters before the maximization and that we now also need a parameter for the standard deviation. For an even more versatile use of the formula interface for building statistical models, check out the very cool rethinking
package by Richard McElreath.
optim()
Of course, if we want to be really versatile, we can craft our own log-likelihood function to maximized using optim()
, also part of base R. This gives us all the options, but there are also more things that can go wrong: We might make mistakes in the model specification and if the search for the optimal parameters is not initialized well the model might not converge at all! A linear regression log-likelihood could look like this:
log_like_fn <- function(par, d) {
sigma <- exp(par[1])
intercept <- par[2]
beta1 <- par[3]
beta2 <- par[4]
mu <- intercept + d$x1 * beta1 + d$x2 * beta2
sum(dnorm(d$y, mu, sigma, log=TRUE))
}
inits <- rnorm(4)
optim(par = inits, fn = log_like_fn, control = list(fnscale = -1), d = d)
## $par
## [1] 1.0399 -0.2964 10.3637 21.2139
##
## $value
## [1] -73.81
##
## $counts
## function gradient
## 431 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
As the convergence
returned 0
it hopefully worked fine (a 1
indicates non-convergence). The control = list(fnscale = -1)
argument is just there to make optim()
do maximum likelihood estimation rather than minimum likelihood estimation (which must surely be the worst estimation method ever).
optimizing()
Stan is a stand alone program that plays well with R, and that allows you to specify a model in Stan’s language which will compile down to very efficient C++ code. Stan was originally built for doing Hamiltonian Monte Carlo, but now also includes an optimizing()
function that, like R’s optim()
, allows you to do maximum likelihood estimation (or maximum a posteriori estimation, if you explicitly included priors in the model definition). Here we need to do a fair bit of work before we can fit a linear regression but what we gain is extreme flexibility in extending this model, would we need to. We have come a long way from lm
…
library(rstan)
## Loading required package: inline
##
## Attaching package: 'inline'
##
## The following object is masked from 'package:Rcpp':
##
## registerPlugin
##
## rstan (Version 2.6.0, packaged: 2015-02-06 21:02:34 UTC, GitRev: 198082f07a60)
##
## Attaching package: 'rstan'
##
## The following object is masked from 'package:arm':
##
## traceplot
model_string <- "
data {
int n;
vector[n] y;
vector[n] x1;
vector[n] x2;
}
parameters {
real intercept;
real beta1;
real beta2;
real<lower=0> sigma;
}
model {
vector[n] mu;
mu <- intercept + x1 * beta1 + x2 * beta2;
y ~ normal(mu, sigma);
}
"
data_list <- list(n = nrow(d), y = d$y, x1 = d$x1, x2 = d$x2)
model <- stan_model(model_code = model_string)
fit <- optimizing(model, data_list)
fit
## $par
## intercept beta1 beta2 sigma
## -0.2929 10.3642 21.2248 2.8331
##
## $value
## [1] -46.24
So, just for fun, here is the speed comparison, first for running a linear regression with 1000 data points and 5 predictors:
This should be taken with a huge heap of salt (which is not too good for your health!). While all these methods produce a result equivalent to a linear regression they do it in different ways, and not necessary in equally good ways, for example, my homemade optim
routine is not converging correctly when trying to fit a model with too many predictors. As I have used the standard settings there is surely a multitude of ways in which any of these methods can be made faster. Anyway, here is what happens if we vary the number of predictors and the number of data points:
To make these speed comparisons I used the microbenchmark
package, the full script replicating the plots above can be found here. This speed comparison was made on my laptop running R version 3.1.2, on 32 bit Ubuntu 12.04, with an average amount of RAM and a processor that is starting to get a bit tired.
“Behind every great point estimate stands a minimized loss function.” – Me, just now
This is a continuation of Probable Points and Credible Intervals, a series of posts on Bayesian point and interval estimates. In Part 1 we looked at these estimates as graphical summaries, useful when it’s difficult to plot the whole posterior in good way. Here I’ll instead look at points and intervals from a decision theoretical perspective, in my opinion the conceptually cleanest way of characterizing what these constructs are.
If you don’t know that much about Bayesian decision theory, just chillax. When doing Bayesian data analysis you get it “pretty much for free” as esteemed statistician Andrew Gelman puts it. He then adds that it’s “not quite right because it can take effort to define a reasonable utility function.” Well, perhaps not free, but it is still relatively straight forward! I will use a toy problem to illustrate how Bayesian decision theory can be used to produce point estimates and intervals. The problem is this: Our favorite droid has gone missing and we desperately want to find him!
Robo went missing 23:00 yesterday and haven’t been seen since. We know he disappeared somewhere within a 120 miles long strip of land and we are going to mount a search operation. Our top scientists have been up all night analyzing the available data and we just received the result: the probability of Robo being in different locations.
So, this is (in Bayesian lingo) a posterior distribution, the probability of different “states” after having analyzed the available data. Here the “state” is the location of Robo and looking at the posterior above it seems like he could be in a lot of places. Most likely he is in the forest, somewhere between 75 and 120 miles from the reference point (arbitrarily set to the left most position on the map). He might also be hiding in the plains, either around the 15th or the 40th mile. It’s not that likely that he’s in the mountains, but we can’t dismiss it altogether.
- So, where should we start looking for Robo?
- Well, that depends…
- Depends on what?
- Your loss function.
A loss function is some method of calculating how bad a decision would be if the world is in a certain state. In our case the state is the location of Robo, the decision could be where to start looking for him, and badness could be the time it will take to find him (we want to find him fast!). If we knew the state of the world, we could find the best decision: the decision that minimizes the loss. Now, we don’t actually know that state but, if we have a Bayesian model that we believe does a good job, we can use the resulting posterior to represent our knowledge about that state. That is, we are going to plug in a possible decision, and a posterior distribution, to our loss function and the result will be a probability distribution over how large the loss might be. Doing this is really easy, especially if the posterior is represented as a sample of values (which is almost always the case when doing Bayesian data analysis anyway). Of course, we could skip a formal decision analysis and just look at the posterior and make a non-formalized decision. In many case that might be the preferred course, but it’s not why we are here today.
So we call our science team up and ask them to send over the posterior represented as a large sample of positions, let’s call that list s
. Here are the first 16 samples in s
:
head(s, n = 16)
## [1] 15 101 41 89 14 41 83 112 33 33 94 104 88 82 77 18
As expected, these values are mostly clustered around the 15th, 40th and 90th mile. Say that our loss function is the distance from where we start searching to the location of Robo, and our decision is to start the search at the 60th mile. To get a probability distribution for the loss we simply apply the loss function to each sample in s
. For the first sample the loss is abs(15 - 60)
= 45, for the second sample it’s abs(101 - 60)
= 41, and so on. Below is the resulting posterior loss, given the decision to start searching at the 60th mile:
The plot above also show the expected loss (here the mean of the posterior distance from the 60th mile). This is a common measure of how good a decision is and the final step in a decision analysis would be to find the decision that minimizes the expected loss. And that’s it! As Gelman mentioned, the hard part is defining a reasonable loss function, but once you have done that, it’s straight forward to find the decision that minimizes the expected loss.
The rest of the post is dedicated to showing how one can define different loss functions for the “Where’s Robo?” scenario. I will start out with some simple loss functions that result in point estimates and end with some more complicated loss function that result in interval estimates.
A Bayesian point estimate is the result of a decision analysis where you (or perhaps your computer) have found the best point/location/value given a posterior distribution and some loss function. Meanwhile, the management have decided that the search party will be deployed by helicopter and that, once on the ground, it will split into two groups, one searching to the right and one searching to the left. Now we only need to decide where to start the search for Robo. Thus, we desperately need a Bayesian point estimate and to get that we need a loss function!
When it comes to loss functions there are three usual suspects: squared loss, absolute loss and 0-1 loss (also known as L2, L1 and L0 loss). We’ve already seen absolute loss (L1), it’s when the loss is the distance between the decision and the state of the world, that is, the absolute value of the difference between the point x
and the state s
. In R code:
absolute_loss <- function(x, s) {
mean(abs(x - s))
}
Due to taking the mean
, this function will also return the expected absolute loss when s
is a posterior sample. For our present purpose this is a pretty decent loss function. Assuming that the two search groups walk at constant speed, this function will minimize the expected time/cost it takes to find Robo.
Squared loss (L2) is another common loss function:
squared_loss <- function(x, s) {
mean((x - s)^2)
}
Using this loss function would mean that we consider it four times as bad if it takes twice the time to find Robo (again assuming the two search groups walk at constant speed). So squared loss might not make that much sense for the present problem.
The last loss function is 0-1 loss (L0) which assigns zero loss to a decision that is correct and one loss to an incorrect decision. Given this loss function the best decision is to choose the most probable state. This loss function make sense if you are, say, defusing a bomb and need to choose between the green, blue and red wire (if you make the right decision = no loss of limbs, cut the wrong wire = Boom!). When searching for Robo it doesn’t really make sense to say that starting the search 1 mile from Robo’s location is as bad as starting it on the moon. As Robo’s position is a real number, the posterior probability of him being in any specific position is practically zero. In this continuous case we can instead use the posterior probability density. If you have a sample from the posterior (s
) then the density can be approximated using density(s)
. As the resulting density is given at discrete points we have to use approx
to interpolate the density at the decision x
, and we have to negate the resulting density estimate to turn this into a loss. Here is the whole function:
zero_one_loss <- function(x, s) {
# This below is proportional to 0-1 loss.
- approx(density(s), xout = x)$y
}
So, where should we start the search operation according to these three loss functions? To figure this out we just need to determine what decision minimizes the expected loss. This can be done in more or less intelligent ways, but I went brute force and just tried all positions from the 0th to the 120th mile. Here are the resulting point estimates (with the loss functions below):
According to the absolute loss criteria (L1) we should start looking in the forest, according to the quadratic loss (L2) we should start in the mountains and 0-1 loss (L0) goes for the single most probable location at the 15th mile. The way with which I found these point estimates is very general, evaluate the loss function all possible decisions (or a representative sample) and pick the decision with the smallest expected loss. However, for these specific loss functions there is a much easier way: the minimum expected absolute loss corresponds to the median of the posterior, the quadratic loss corresponds to the mean, and 0-1 loss corresponds to the mode. That is, the same three point estimates we looked at in Part 1 of Probable Points and Credible Intervals! Why exactly this is the case is beautifully explained by John Myles White on his blog.
Note, that using the three loss functions above result in widely different decisions, it’s a big difference between landing the search team in the forest and in the windy mountains, and it’s a bit strange that the loss functions don’t consider aspects of the problem such as the terrain. Going forward I will explore a couple of different loss functions more suited to the “Where’s Robot?” scenario. This is not because these are loss functions that are especially useful and widely applicable, but rather because I want to show how easy it is to define new loss functions when doing Bayesian decision analysis.
We got a call from from management and they have decided that instead of sending a search team, we are going to do a satellite scan that is guaranteed to find any robot within a radius of 30 miles, and now they want to know where to target it. This calls for another loss function! As with 0-1 loss we want to minimize the expectation of not finding Robo, but now it is within a certain radius around the decision point x
. In R:
limited_dist_loss <- function(x, s, max_dist) {
mean(abs(x - s) > max_dist)
}
This code calculates the expectation of Robo being outside the max_dist
radius around x
, where max_dist
should be set to 30 in our case. Using this loss function with our posterior s
gives us the following graph:
So, we should center the scan on the 89th mile, which will scan the forest and part of the mountains, and which will result in a 1.0 - 0.4 = 0.6 probability of finding Robo.
We got new info from management: The droid carries some space station plans critical to the empire something something. Anyway, we need to find Robo fast, within 24 hours! Again we are going to deploy a search team that will split into two groups, but this time we need to consider how long time it will take for the teams to find him. It takes different amounts of time to search different types of terrain: a mile of plains takes one hour, a mile of forest takes five hours and a mile of mountains takes ten. The list cover_time
encodes the time it takes to search each mile, here we have the 48th to the 54th mile:
cover_time[48:54]
## plain plain plain mountain mountain mountain mountain
## 1 1 1 10 10 10 10
The following loss function calculates the expectation of not finding Robo within max_time
hours by calculating the time to Robo’s location from the starting point x
and then taking the expectation of this time being longer than max_time
:
limited_time_loss <- function(x, s, max_time) {
time_to_robo <- sapply(s, function(robo_pos) { sum(cover_time[x:robo_pos]) })
mean(time_to_robo > max_time )
}
This is how the expected loss looks for different starting points with max_time
set to 24:
Our best bet is to start at the 27th mile which means we will cover the whole plains area within 24 hours. If we instead had to find Robo within 72 hours it would be better to start at the 90th mile, as we now would have time to search the forest region:
A Bayesian interval estimate is the result of a decision analysis where you have found the best interval given a posterior distribution and some loss function. To decide what interval or region to search through is perhaps a more natural decision when looking for Robo, rather than deciding where to land a search team. In part one we looked at some different type of intervals, one was the highest density interval (HDI) defined as the shortest interval that contains a given percentage of the posterior probability. The HDI can also be defined as the interval that minimizes an expected loss (the specific loss function is derived here, but is a tiny bit complicated). Say that we want to find Robo with 90% probability while having to search through the smallest region. The best decision would then be the following 90% HDI:
A strange thing with this type of interval is that we limit the probability of finding Robo. Surely we would like to find Robo with the highest probability possible. What’s limiting us from finding Robo with a 100% probability should be time, effort or cost.
Let’s define an interval that is limited by cost instead. The management have decided that the search operation will cost \$1000. One hour of search costs \$100 and using our knowledge about how long time it takes to search a mile of each type of terrain we can calculate the corresponding search cost:
search_cost <- cover_time * 100
search_cost[48:54]
## plain plain plain mountain mountain mountain mountain
## 100 100 100 1000 1000 1000 1000
So searching through the 48th to the 54th mile would cost \$4300, a bit over budget. What we want to find is the interval with the highest expectation of finding Robo but that costs no more than \$1000 to search through. The loss function is basically just a variation of limited_time_loss
but with two parameters: the lower and the upper endpoints of the interval. To find the best interval I, again, just try all combinations of upper and lower endpoints and pick out the interval with the lowest loss (highest expectation of finding Robo) which costs \$1000 or less to search through:
So, if we just have \$1000 to spend we should go for the easy option and just search through the high probability region on the plains. What if we had \$3000 to spend?
Then we should search through almost the whole plains region. What if we had \$20,000?
Then we should go for the forest (and still stay away from the mountains, they don’t make good fiscal sense).
Now, we call up management and tell them that “it’s all good and well that you want to spend \$1000 on finding Robo, but why \$1000 exactly? And wouldn’t you want to spend as little as possible?” They tell us that, yes, they would like to spend as little as possible and the reason for the \$1000 figure is because that’s what Robo is worth. Ok, so what we want to do is to mount a search that maximizes the expected profit considering that Robo is worth \$1000 and that it costs \$100 per search hour. This calls for a utility function (something you want to maximize) which is just the opposite of a loss function (something you want to minimize). While any utility function can easily be cast into a loss function, it’s sometimes more natural to think of maximizing utility (say in finance) than minimizing losses. Loss functions and utility functions can both be called objective functions.
This is going to be the most complicated objective function in this post. The search is going to happen like this: We are going to start the search at one location, search in one direction, and stop the search if (1) we find Robo or (2) we have reached the location that marks the end of the search operation. The decision we have to make is where the search starts and where it terminates (in case we don’t find Robo). What we want to calculate is the expected profit given such a decision. The following function takes a start
, an end
and a robo_value
, calculates the profit for each sample from Robo’s posterior position s
and returns the expected profit:
expected_profit <- function(start, end, robo_value, s) {
posterior_profit <- sapply(s, function(robo_pos) {
if(robo_pos >= start & robo_pos <= end) {
#that is, we find Robo and terminate the search at robo_pos
covered_ground <- start:robo_pos
robo_value - sum(search_cost[covered_ground])
} else {
# that is, we won't find Robo and terminate the search at end instead,
covered_ground <- start:end
- sum(search_cost[covered_ground])
}
})
mean(posterior_profit)
}
If we evaluate this utility function for (a representative sample of) all possible values for start
and end
, and with robo_value
set to 1000, the maximum expected profit decision is this interval:
Huh? That doesn’t look like an interval… But it is, you’re looking at an empty interval. The best decision is to not search for Robo at all (with an expected profit of \$0) any search operation will result in an expected negative profit (that is, a loss). Good to know! What if Robo was worth more money? Say, \$10,000?
Then we should search a small part of the plains, starting at the 12th mile, for an expected profit of \$835. If Robo was worth \$20,000?
Then we should search most of the plains, still starting from the left, for an expected profit of \$3733. Say, if Robo was really valuable?
Then we want to search the whole strip of land for an expected profit of \$16590. Notice that we should always start at the left, this is because it is relatively cheap to search the plains and our profit will be much higher if we find Robo before we spend to much on the search operation.
So this was just a sample of possible loss/utility functions for a simple toy problem, some better and some worse. I stuck with point decisions and interval decisions, but there is no reason for why you should be limited to single intervals. Perhaps you want launch several search parties, or perhaps you want to update the decision as you get new information. Constructing reasonable loss functions for real world problems can be very challenging, but the point is that Bayesian decision analysis still works in the same way as outlined in this post: (1) Get a posterior, (2) define a loss function, (3) find the decision that minimizes the expected loss. My toy example featured only a one-dimensional posterior, but the procedure would be no different with a multi-dimensional posterior (except for the added difficulty of optimizing a high-dimensional loss function).
Update: João Pedro at Faculdade de Ciências da Universidade de Lisboa has reimplemented the Robo scenario, but in 2D! Check it out, it’s really nice!
Another point I want to get across is that loss/utility functions are models of what’s bad/good and as such they can be made arbitrarily complex and are in some sense never “true”, in the same way as a statistical model of some process is never the “true” model. Or as Hennig and Kutlukaya (2007) puts it:
“There is no objectively best loss function, because the loss function defines what ‘good’ means.”
By the way, did you find Robo?
Hennig, C., & Kutlukaya, M. (2007). Some thoughts about the design of loss functions. REVSTAT–Statistical Journal, 5(1), 19-39. pdf
White, J. M. (2013). Modes, Medians and Means: A Unifying Perspective. link
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. Springer. Amazon link
Chapter 9 in Gelman et al (2013). Bayesian data analysis. CRC press. Amazon link
]]>Peter Norvig, the director of research at Google, wrote a nice essay on How to Write a Spelling Corrector a couple of years ago. That essay explains and implements a simple but effective spelling correction function in just 21 lines of Python. Highly recommended reading! I was wondering how many lines it would take to write something similar in base R. Turns out you can do it in (at least) two pretty obfuscated lines:
sorted_words <- names(sort(table(strsplit(tolower(paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")), "[^a-z]+")), decreasing = TRUE))
correct <- function(word) { c(sorted_words[ adist(word, sorted_words) <= min(adist(word, sorted_words), 2)], word)[1] }
While not working exactly as Norvig’s version it should result in similar spelling corrections:
correct("piese")
## [1] "piece"
correct("ov")
## [1] "of"
correct("cakke")
## [1] "cake"
So let’s deobfuscate the two-liner slightly (however, the code below might not make sense if you don’t read Norvig’s essay first):
# Read in big.txt, a 6.5 mb collection of different English texts.
raw_text <- paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")
# Make the text lowercase and split it up creating a huge vector of word tokens.
split_text <- strsplit(tolower(raw_text), "[^a-z]+")
# Count the number of different type of words.
word_count <- table(split_text)
# Sort the words and create an ordered vector with the most common type of words first.
sorted_words <- names(sort(word_count, decreasing = TRUE))
correct <- function(word) {
# Calculate the edit distance between the word and all other words in sorted_words.
edit_dist <- adist(word, sorted_words)
# Calculate the minimum edit distance to find a word that exists in big.txt
# with a limit of two edits.
min_edit_dist <- min(edit_dist, 2)
# Generate a vector with all words with this minimum edit distance.
# Since sorted_words is ordered from most common to least common, the resulting
# vector will have the most common / probable match first.
proposals_by_prob <- c(sorted_words[ edit_dist <= min(edit_dist, 2)])
# In case proposals_by_prob would be empty we append the word to be corrected...
proposals_by_prob <- c(proposals_by_prob, word)
# ... and return the first / most probable word in the vector.
proposals_by_prob[1]
}
Some thoughts:
adist
function. (A one line spell checker in R is indeed possible using the aspell
function :)sorted_words
vector would be a perfect target for some magrittr magic.NWORDS
variable in order to be able to extract the most probable matching word. This is not necessary in the R code, as we already have a sorted vector we know that the first item always will be the most probable. Still, I believe the two approaches result in the same spelling corrections (but prove me wrong :).HashMap<Integer, String> candidates = new HashMap<Integer, String>();
.