In a previous post I used the the Million Base 2.2 chess data base to calculate the predictive piece values of chess pieces. It worked out pretty well and here, just for fun, I thought I would check out what happens with the predictive piece values over the course of a chess game. In the previous analysis, the data (1,000,000 chess positions) was from all parts of the chess games. Here, instead, are the predictive piece values using only positions up to the 10th first full move (a full move is when White and Black each have made a move):
Compared with the predictive piece values using positions from all parts of the chess games the values above are much closer to zero. As the values are given as log-odds (again, see the original post for a brief explanation) this means that the piece balance on the board in the first ten full moves doesn’t predict the outcome of the game very well. This makes sense as how well a player manages the opening of a game isn’t necessarily manifested as a piece advantage until much later in the game. Also, notice that the loss of a rook actually results in a slightly higher probability of winning! This could be due to just a couple of games in the whole data set where one player sacrifices a rook for a positional advantage (as I figure it is pretty rare to lose a rook already during the ten first full moves).
Most of the games in my data set have ended after 60 moves, as this plot shows:
Therefore, I split up the data set into bins of 10 full moves, up to 60 full moves, which resulted in the following predictive piece values:
So, as we are getting later into a chess game, the stronger a piece advantage predicts a win. We can also scale the log-odds values so that they are relative to the value of a pawn, with a pawn fixed to 1.0 :
I don’t have much analysis to offer here, except for pointing out the obvious that (1) as before, the later we get into a chess game, the stronger a piece advantage predicts a win, (2) in the late game (full moves 50-60) the predictive piece values almost reach the usual piece values (♟:1, ♞:3, ♝:3, ♜:5, and ♛:9), and (3) that having the advantage of playing White (☆) contributes more to the prediction early in the game, but gets closer to zero later in the game.
If you want to explore the the Million Base 2.2 data base yourself, or want to replicate the analysis above, you’ll find the scripts for doing this in the original Big Data and Chess post.
]]>Who doesn’t like chess? Me! Sure, I like the idea of chess – intellectual masterminds battling each other using nothing but pure thought – the problem is that I tend to lose, probably because I don’t really know how to play well, and because I never practice. I do know one thing: How much the different pieces are worth, their point values:
This was among the first things I learned when I was taught chess by my father. Given these point values it should be as valuable to have a knight on the board as having three pawns, for example. So where do these values come from? The point values are not actually part of the rules for chess, but are rather just used as a guideline when trading pieces, and they seem to be based on the expert judgment of chess authorities. (According to the guardian of truth there are many alternative valuations, all in the same ballpark as above.) As I recently learned that it is very important to be able to write Big Data on your CV, I decided to see if these point values could be retrieved using zero expert judgement in favor of huge amounts of chess game data.
How to allocate point values to the chess pieces using only data? One way of doing this is to calculate the predictive values of the chess pieces. That is, given that we only know the current number of pieces in a game of chess and use that information to predict the outcome of that game, how much does each type of piece contribute to the prediction? We need a model to predict the outcome of chess games where we have the following restrictions:
Now these restrictions might feel a bit restrictive, especially if we actually would want to predict the outcome of chess games as well as possible, but they come from that the original point values follow the same restrictions. As the original point values doesn’t change with context, neither should ours. Now, as my colleague Can Kabadayi (with an ELO well above 2000) remarked: “But context is everything in Chess!”. Absolutely, but I’m not trying to do anything profound here, this is just a fun exercise! :)
Given the restrictions there is one obvious model: Logistic regression, a vanilla statistical model that calculates the probability of a binary event, like a loss-win. To get it going I needed data and the biggest Big Data data set I could find was the Million Base 2.2 which contains over 2.2 million chess games. I had to do a fair deal of data munging to get it into a format that I could work with, but the final result was a table with a lot of rows that looked like this:
pawn_diff rook_diff knight_diff bishop_diff queen_diff white_win
1 0 1 -1 0 TRUE
Here each row is from a position in a game where a positive number means White has more of that piece. For the position above white has one more pawn and knight, but one less bishop than Black. Last in each row we get to know whether White won or lost in the end, as logistic regression assumes a binary outcome I discarded all games that ended in a draw. My résumé is unfortunately not going to look that good as I never really solved the Big Data problem well. Two million chess games are a lot of games and it took my poor old laptop over a day to process only the first 100,000 games. Then I had the classic Big Data problem that I couldn’t fit it all into working memory, so I simply threw away data until worked. Still, for the analysis I ended up using a sample of 1,000,000 chess positions from the first 100,000 games in the Million Base 2.2 . Big enough data for me.
Using the statistical language R I first fitted the following logistic model using maximum likelihood (here described by R’s formula language):
white_win ~ 1 + pawn_diff + knight_diff + bishop_diff + rook_diff + queen_diff
This resulted in the following piece values:
Three things to note: (1) In addition to the piece values, the model also included a coefficient for the advantage of going first, called White’s advantage above. (2) The predictive piece values ranks the pieces in the same order as the original piece values does. (3) The piece values are given in log-odds, which can be a bit tricky to interpret but that can be easily transformed into probabilities as this graph shows:
Here White’s advantage translates to a 56% chance of White winning (everything else being equal), being two knights and one rook ahead but one pawn behind gives 92% chance of winning, while being one queen behind gives only a 8% chance of winning. While log-odds are useful if you want to calculate probabilities, the original piece values are not given in log-odds, instead they are set relative to the value of a pawn which is fixed at 1.0 . Let’s scale our log-odds so that the pawn is give a piece value of 1.0 :
We see now that, while the ranking is roughly the same, the scale is compressed compared to the original piece values. A queen is usually considered as nine times more valuable than a pawn, yet when it comes to predicting the outcome of a game a queen advantage counts the same as only a four pawn advantage. A second thing to notice is that bishops are valued slightly higher than knights. If you look at the Wiki page for Chess piece relative value you find that some alternative valuations value the bishop slightly higher than the knight, others add ½ point for a complete bishop pair. We can add that to the model!
white_win ~ 1 + pawn_diff + knight_diff + bishop_diff + rook_diff + queen_diff + bishop_pair_diff
Now with a pair of bishops getting their own value, the values of a knight and a single bishop are roughly equal. There is still the “mystery” regarding the low valuation of all the pieces compared to the pawn. (This doesn’t really have to be a mystery as there is no reason why predictive piece values necessarily should be the same as the original piece values). Instead of anchoring the value of a pawn to 1.0 we could anchor the value of another piece to it’s original piece value. Let’s anchor the knight to 3.0:
Now the value of the pieces (excluding the pawn) line up very nicely with the original piece values! So, as I don’t really play chess, I don’t know why a pawn advantage is such a strong predictor of winning (compared to the original piece values, that is). My colleague Can Kabadayi (the ELO > 2000 guy!) had the following to say:
In a high-class game the pawns can be more valuable – they are often the decisive element of the outcome – one can even say that the whole plan of the middle game is to munch a pawn somewhere and then convert it in the endgame by exchanging the pieces, thus increasing the value of the pawn. Grandmaster games tend to go to a rook endgame where both sides have a rook but one side has an extra pawn. It is not easy to convert these positions into a win, but you see the general idea. A passed pawn (a pawn that does not have any enemy pawns blocking its march to become a queen) is also an important asset in chess, as they are a constant threat to become a queen.
Can also gave me two quotes from legendary chess Grandmasters José Capablanca and Paul Keres relating to the value of pawns:
The winning of a pawn among good players of even strength often means the winning of the game. – Jose Capablanca
The older I grow, the more I value pawns. – Paul Keres
Another thing to keep in mind is that the predictive piece values might have looked very difference if I had used a different data set. For example, the players in the current data set are all very skilled, having a median ELO of 2400 and with 95% of the players having an ELO between 2145 and 2660. Still I think it is cool that the predictive piece values matched the original piece values as well as they did!
Update: See the follow-up where I look at how the predictive point values change as the game progresses.
The non-parametric bootstrap was my first love. I was lost in a muddy swamp of zs, ts and ps when I first saw her. Conceptually beautiful, simple to implement, easy to understand (I thought back then, at least). And when she whispered in my ear, “I make no assumptions regarding the underlying distribution”, I was in love. This love lasted roughly a year, but the more I learned about statistical modeling, especially the Bayesian kind, the more suspect I found the bootstrap. It is most often explained as a procedure, not a model, but what are you actually assuming when you “sample with replacement”? And what is the underlying model?
Still, the bootstrap produces something that looks very much like draws from a posterior and there are papers comparing the bootstrap to Bayesian models (for example, Alfaro et al., 2003). Some also wonder which alternative is more appropriate: Bayes or bootstrap? But these are not opposing alternatives, because the non-parametric bootstrap is a Bayesian model.
In this post I will show how the classical non-parametric bootstrap of Efron (1979) can be viewed as a Bayesian model. I will start by introducing the so-called Bayesian bootstrap and then I will show three ways the classical bootstrap can be considered a special case of the Bayesian bootstrap. So basically this post is just a rehash of Rubin’s The Bayesian Bootstrap from 1981. Some points before we start:
Just because the bootstrap is a Bayesian model doesn’t mean it’s not also a frequentist model. It’s just different points of view.
Just because it’s Bayesian doesn’t necessarily mean it’s any good. “We used a Bayesian model” is as much a quality assurance as “we used probability to calculate something”. However, writing out a statistical method as a Bayesian model can help you understand when that method could work well and how it can be made better (it sure helps me!).
Just because the bootstrap is sometimes presented as making almost no assumptions, doesn’t mean it does. Both the classical non-parametric bootstrap and the Bayesian bootstrap make very strong assumptions which can be pretty sensible and/or weird depending on the situation.
Let’s start with describing the Bayesian bootstrap of Rubin (1981), which the classical bootstrap can be seen as a special case of. Let $d = (d_1, \ldots, d_K)$ be a vector of all the possible values (categorical or numerical) that the data $x = (x_1, \ldots, x_N)$ could possibly take. It might sound strange that we should be able to enumerate all the possible values the data can take, what if the data is measured on a continuous scale? But, as Rubin writes, “[this] is no real restriction because all data as observed are discrete”. Then, each $x_i$ is modeled as being drawn from the $d$ possible values where the probability of $x_i$ receiving a certain value from $d$ depends on a vector of probabilities $\pi = (\pi_1, \ldots, \pi_K)$, where $\pi_1$ is the probability of drawing $d_1$. Using a categorical distribution, we can write it like this:
$$% <![CDATA[ \begin{align} &\begin{array}{l} x_i = d_{k_i}\ k_i \sim \text{Categorical}(\pi) \end{array} \bigg\} ~ \text{for $i$ in $1..N$} \ \end{align} %]]>$$
Now we only need a prior distribution over the $\pi$s for the model to be complete. That distribution is the Dirichlet distribution which is a distribution over proportions. That is, the Dirichlet is a multivariate distribution which has support over vectors of real numbers between 0.0 and 1.0 that together sums to 1.0 . A 2-dimensional Dirichlet is the same as a Beta distribution and is defined on the line where $\pi_1 + \pi_2$ is always 1, the 3-dimensional Dirichlet is defined on the triangle where $\pi_1 + \pi_2 + \pi_3$ is always 1, and so on. A $K$-dimensional Dirichlet has $K$ parameters $\alpha = (\alpha_1, \ldots, \alpha_K)$ where the expected proportion of, say, $\pi_1$ is $\alpha_1 / \sum \alpha_{1..K}$ . The higher the sum of all the $\alpha$s, the more the distribution concentrates on the expected proportion. If instead $\sum \alpha_{1..K}$ approaches 0, the distribution concentrates on points with few large proportions. This behavior is illustrated below using draws from a 3-dimensional Dirichlet where $\alpha$ is set to different values and where red means higher density:
When $\alpha = (1,1,1)$ the distribution is uniform, any combination of $(\pi_1,\pi_2,\pi_3)$ that forms a proportion is equally likely. But as $\alpha \rightarrow (0, 0, 0)$ the density is “pushed” towards the edges of the triangle making proportions like $(033, 0.33, 0.33)$ very unlikely in favor of proportions like $(0.9, 0.1, 0.0)$ and $(0.0, 0.5, 0.5)$. We want a Dirichlet distribution of this latter type, a distribution that puts most of the density over combination of proportions where most of the $\pi$s are zero and only few $\pi$s are large. Using this type of prior will make the model consider it very likely apriori that most of the data $x$ is produced from a small number of the possible values $d$. And in the limit when $\alpha = (0, \ldots, 0)$ the model will consider it impossible that $x$ takes on more than one value in $d$ unless there is data that shows otherwise. So using a $\text{Dirichlet}(0_1, \ldots, 0_K)$ over $\pi$ achieves the hallmark of the bootstrap, that the model only considers values already seen in the data as possible. The full model is then:
$$% <![CDATA[ \begin{align} &\begin{array}{l} x_i \leftarrow d_{k_i}\ k_i \sim \text{Categorical}(\pi) \end{array} \bigg\} ~ \text{for $i$ in $1..N$} \ &\pi \sim \text{Dirichlet}(0_1, \ldots, 0_K) \end{align} %]]>$$
So is this a reasonable model? Surprise, surprise: It depends. For binary data, $d = (0, 1)$, the Bayesian bootstrap is the same as assuming $x_i \sim \text{Bernoulli}(p)$ with an improper $p \sim \text{Beta(0,0)}$ prior. A completely reasonable model, if you’re fine with the non-informative prior. Similarly it reduces to a categorical model when $d$ are a number of categories. For integer data, like count data, the Bayesian bootstrap implies treating each possible value as its own isolated category, disregarding any prior information regarding a relation between the values (such that three eggs are more that two eggs, but less than four). For continuous data the assumptions of the bootstrap feel a bit weird because we are leaving out obvious information: That the data is actually continuous and that a data point of, say, 3.1 should inform the model that values that are close (like 3.0 and 3.2) are also more likely.
If you don’t include useful prior information in a model you will have to make up with information from another source in order to get as precise estimates. This source is often the data, which means you might need relatively more data when using the bootstrap. You might say that the bootstrap makes very naïve assumptions, or perhaps very conservative assumptions, but to say that the bootstrap makes no assumptions is wrong. It makes really strong assumptions: The data is discrete and values not seen in the data are impossible.
So let’s take the Bayesian bootstrap for a spin by using the cliché example of inferring a mean. I’ll compare it with using the classical non-parametric bootstrap and a Bayesian model with flat priors that assumes that the data is normally distributed. To implement the Bayesian bootstrap I’m using this handy script published at R-snippets.
Compared to the “gold standard” of the Normal Bayesian model both the classical and the Bayesian bootstrap have shorter tails, otherwise they are pretty spot on. Note also that the two bootstrap distributions are virtually identical. Here, and in the model definition, the data $x_i$ was one-dimensional, but it’s easy to generalize to bi-variate data by replacing $x_i$ with $(x_{1i}, x_{2i})$ (and similar for multivariate data).
I feel that, in the case of continuous data, the specification of the Bayesian bootstrap as given above is a bit strange. Sure, “all data as observed are discrete”, but it would be nice with a formulation of the Bayesian bootstrap that fits more natural with continuous data.
The Bayesian bootstrap can be characterized differently than the version given by Rubin (1981). The two versions result in the exact same inferences but I feel that the second version given below is a more natural characterization when the data is assumed continuous. It is also very similar to a Dirichlet process which means that the connection between the Bayesian bootstrap and other Bayesian non-parametric methods is made clearer.
This second characterization requires two more distributions to get going: The Dirac delta distribution and the geometric distribution. The Dirac delta distribution is so simple that is almost doesn’t feel like a distribution at all. It is written $x \sim \delta(x_0)$ and is a probability distribution with zero density except at $x_0$. Assuming, say, $x \sim \delta(5)$ is basically the same as saying that $x$ is fixed at 5. The delta distribution can be seen be seen as a Normal distribution where the standard deviation is approaching zero, as this animation off Wikipedia nicely demonstrates:
The geometric distribution is the distribution over how many “failures” there are in a number of Bernoulli trials before there is a success, where the one parameter is $p$, the relative frequency of success. Here are some geometric distributions with different $p$:
We’ll begin this second version of the Bayesian bootstrap by assuming that the data $x = (x_1, \ldots, x_N)$ is distributed as a mixture of $\delta$ distributions with $M$ components, where $\mu = (\mu_1, \ldots, \mu_M)$ are the parameters of the $\delta$s. The $\mu$s are given a flat $\text{Uniform}(-\infty,\infty)$ distribution and the mixture probabilities $\pi = (\pi_1, \ldots, \pi_M)$ are again given a $\text{Dirichlet}(0_1, \ldots,0_M)$ distribution. Finally, $M$, the number of component distributions in the mixture, is given a $\text{Geometric}(p)$ distribution where $p$ is close to 1. Here is the full model:
$$% <![CDATA[ \begin{align} &\begin{array}{l} x_i \sim \delta(\mu_{k_i}) \ k_i \sim \text{Categorical}(\pi) \end{array} \bigg\} &\text{for $i$ in $1..N$} \ &\mu_j \sim \text{Uniform}(-\infty,\infty) &\text{for $j$ in $1..M$} \ &\pi \sim \text{Dirichlet}(0_1, \ldots, 0_M) & ~ \ &M \sim \text{Geometric}(p) &\text{with $p$ close to 1} \end{align} %]]>$$
There is more going on in this version of the Bayesian bootstrap, but what the model is basically assuming is this: The data comes from a limited number of values (the $\mu$s) where each value can be anywhere between $-\infty$ and $\infty$. A data point ($x_i$) comes from a specific value ($\mu_j$) with a probability ($\pi_j$), but what these probabilities $(\pi_1, \ldots, \pi_M)$ are is very uncertain (due to the Dirichlet prior). The only part that remains is how many values ($M$) the data is generated from. This is governed by the Geometric distribution where $p$ can be seen as the probability that the current number of values ($M$) is the maximum number needed. When $p \approx 1$ the number of values will be kept to a minimum unless there is overwhelming evidence that another value is needed. But since the data is distributed as a pointy Dirac $\delta$ distribution a set of data of, say, $x = (3.4, 1.2, 4.1)$ is overwhelming evidence of that $M$ is at least 3 as there is no other possible way $x$ could take on three different values.
So, I like this characterization of the Bayesian bootstrap because it connects to Bayesian non-parametrics and it is more easy for me to see how it can be extended. For example, maybe you think the Dirac $\delta$ distribution is unreasonably sharply peaked? Then just swap it for a distribution that better matches what you know about the data (a Normal distribution comes to mind). Do you want to include prior information regarding the location of the data? Then swap the $\text{Uniform}(-\infty,\infty)$ for something more informative. Is it reasonable to assume that there are between five and ten clusters / component distributions? Then replace the geometric distribution with a $\text{Discrete-Uniform}(5, 10)$. And so on. If you want to go down this path you should read up on Bayesian non-parametrics (for example, this tutorial). Actually, for you that are already into this, a third characterization of the Bayesian bootstrap is as a Dirichlet process with $\alpha \rightarrow 0$ (Clyde and Lee, 2001).
Again, the bootstrap is a very “naïve” model. A personification of the bootstrap would be a savant learning about peoples lengths, being infinitely surprised by each new length she observed. “Gosh! I knew people can be 165 cm or 167 cm, but look at you, you are 166 cm, who knew something like that was even possible?!”. However, while it will take many many examples, Betty Bootstrap will eventually get a pretty good grip on the distribution of lengths in the population. Now that I’ve written at length about the Bayesian bootstrap, what is its relation with the classical non-parametric bootstrap?
I can think of three ways the classical bootstrap of Efron (1979) can be considered a special case of the Bayesian bootstrap. Just because the classical bootstrap can be considered a special case doesn’t mean it is necessarily “better” or “worse”. But, from a Bayesian perspective, I don’t see how the classical bootstrap has any advantage over the Bayesian (except for being computationally more efficient, easier to implement and perhaps more well know by the target audience of the analysis…). So in what way is the classical bootstrap a special case?
When implemented by Monte Carlo methods, both the classical and Bayesian bootstrap produces draws that can be interpreted as probability weights over the input data. The classical bootstrap does this by “sampling with replacement” which is another way of saying that the weights $\pi = (\pi_1, \ldots, \pi_n)$ for the $N$ data points are created by drawing counts $c = (c_1, \ldots, c_N)$ from a $\text{Multinomial}(p_1, \ldots, p_N)$ distribution with $N$ trials where all $p$s = $1/N$. Each count is then normalized to create the weights: $\pi_i = c_i / N$. For example, say we have five data points, we draw from a Multinomial and get $(0, 2, 2, 1, 0)$ which we normalize by dividing by five to get the weights $(0, 0.4, 0.4, 0.2, 0)$. With the Bayesian bootstrap, the $N$ probability weights can instead be seen as being drawn from a flat $\text{Dirichlet}(1_1, \ldots, 1_N)$ distribution. This follows directly from the model definition of the Bayesian bootstrap and an explanation for why this is the case can be found in Rubin (1981). For example, for our five data points we could get weights $(0, 0.4, 0.4, 0.2, 0)$ or $(0.26,0.1,0.41,0.01,0.22)$.
Setting aside philosophical differences, the only difference between the two methods is in how the weights are drawn, and both methods result in very similar weight distributions. The mean of either weight distributions is the same, each probability weight $\pi_j$ has a mean of $1/N$ both when using the Multinomial and the Dirichlet. The variance of the weights are almost the same. For the classical bootstrap the variance is $(n + 1) / n$ times the variance for the bootstrap weights and this difference grows small very quickly as $n$ gets large. These similarities are presented in Rubin’s original paper on the Bayesian bootstrap and discussed in a friendly manner by Friedman, Hastie and Tibshirani (2009) on p. 271.
From a Bayesian perspective I find three things that are slightly strange with how the weights are drawn in the classical bootstrap (but from a sampling distribution perspective it makes total sense, of course):
Let’s try to visualize the difference between the two versions of the bootstrap! Below is a graph where each colored column is a draw of probability weights, either from a Dirichlet distribution (to the left) or using the classical resampling scheme (to the right). The first row shows the case with two data points ($N = 2$). Here the difference is large, draws from the Dirichlet vary smoothly between 0% and 100% while the resampling weights are either 0%, 50% or 100%, with 50% being roughly twice as common. However, as the number of data points increases, the resampling weights vary more smoothly and become more similar to the Dirichlet weights.
This difference in how the weights are sampled can also be seen when comparing the resulting distributions over the data. Below, the classical and the Bayesian bootstrap are used to infer a mean when applied to two, four and eight samples from a $\text{Normal(0, 1)}$ distribution. At $N = 2$ the resulting distributions look very different, but they look almost identical already at $N = 8$. (One reason for this is because we are inferring a mean, other statistics could require many more data points before the two bootstrap methods “converge”.)
It is sometimes written that “the Bayesian bootstrap can be thought of as a smoothed version of the Efron bootstrap” (Lancaster, 2003), but you could equally well think of the classical bootstrap as a rough version of the Bayesian bootstrap! Nevertheless, as $N$ gets larger the classical bootstrap quickly becomes a good approximation to the Bayesian bootstrap, and similarly the Bayesian bootstrap quickly becomes a good approximation to the classical one.
Above we saw a connection between the Bayesian bootstrap and the classical bootstrap procedure, that is, using sampling with replacement to create a distribution over some statistic. But you can also show the connection between the models underlying both methods. For the classical bootstrap the underlying model is that the distribution of the data is the distribution of the population. For the Bayesian bootstrap the values in the data define the support of the predictive distribution, but how much each value contributes to the predictive depends on the probability weights which are, again, distributed as a $\text{Dirichlet}(1, \ldots, 1)$ distribution. If we discard the uncertainty in this distribution by taking a point estimate of the probability weights, say the posterior mean, we end up with the following weights: $(1/N, \ldots, 1/N)$. That is, each data point contributes equally to the posterior predictive, which is exactly the assumption of the classical bootstrap. So if you just look at the underlying models, and skip that part where you simulate a sampling distribution, the classical bootstrap can be seen as the posterior mean of the Bayesian bootstrap.
The model of the classical bootstrap can also be put as a special case of the model for the Bayesian bootstrap, version two. In that model the probability weights $\pi = (\pi_1, \ldots, \pi_M)$ were given an uninformative $\text{Dirichlet}(\alpha_1, \ldots,\alpha_M)$ distribution with $\alpha = 0$. If we would increase $\alpha$ then combinations with more equal weights would become successively more likely:
In the limit of $\alpha \rightarrow \infty$, the only possible weight becomes $\pi = (1/M, \ldots, 1/M)$, that is, the model is “convinced” that all seen values contribute exactly equally to the predictive distribution. That is, the same assumption as in the classical bootstrap! Note that this only works if all seen data points are unique (or assumed unique) as would most often be the case with continuous data.
Let’s apply the $\text{Dirichlet}(\infty, \ldots,\infty)$ version of the classical bootstrap to 30 draws from a $\text{Normal}(0, 1)$ distribution. The following animation then illustrates the uncertainty by showing draws from the posterior predictive distribution:
He he, just trolling you. Due to the $\text{Dirichlet}(\infty, \ldots,\infty)$ prior there is no uncertainty at all regarding the predictive distribution. Hence the “animation” is a still image. Let’s apply the Bayesian bootstrap to the same data. The following (actual) animation shows the uncertainty by plotting 200 draws from the posterior predictive distribution:
I like the non-parametric bootstrap, both the classical and the Bayesian version. The bootstrap is easy to explain, easy to run and often gives reasonable results (despite the somewhat weird model assumptions). From a Bayesian perspective it is also very natural to view the classical Bootstrap as an approximation to the Bayesian bootstrap. Or as Friedman et al (2009, p. 272) put it:
In this sense, the bootstrap distribution represents an (approximate) nonparametric, noninformative posterior distribution for our parameter. But this bootstrap distribution is obtained painlessly — without having to formally specify a prior and without having to sample from the posterior distribution. Hence we might think of the bootstrap distribution as a “poor man’s” Bayes posterior. By perturbing the data, the bootstrap approximates the Bayesian effect of perturbing the parameters, and is typically much simpler to carry out.
You can also view the Bayesian bootstrap as a “poor man’s” model. A model that makes very weak assumptions (weak as in uninformative), but that can be used in case you don’t have the time and resources to come up with something better. However, it is almost always possible to come up with a model that is better than the bootstrap, or as Donald B. Rubin (1981) puts it:
[…] is it reasonable to use a model specification that effectively assumes all possible distinct values of X have been observed?
No, probably not.
Alfaro, M. E., Zoller, S., & Lutzoni, F. (2003). Bayes or bootstrap? A simulation study comparing the performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence. ‘Molecular Biology and Evolution, 20(2), 255-266. pdf
Clyde, M. A., & Lee, H. K. (2001). Bagging and the Bayesian bootstrap. In Artificial Intelligence and Statistics. pdf
Efron, B. (1979). Bootstrap methods: another look at the jackknife. The annals of Statistics, 1-26. pdf
Friedman, J., Hastie, T., & Tibshirani, R. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Freely available at http://www-stat.stanford.edu/~tibs/ElemStatLearn/ .
Gershman, S. J., & Blei, D. M. (2012). A tutorial on Bayesian nonparametric models. Journal of Mathematical Psychology, 56(1), 1-12. pdf
Lancaster, T. (2003). A note on bootstraps and robustness. SSRN 896764. pdf
Rubin, D. B. (1981). The Bayesian Bootstrap. The annals of statistics, 9(1), 130-134. pdf
]]>Everybody loves speed comparisons! Is R faster than Python? Is dplyr
faster than data.table
? Is STAN faster than JAGS? It has been said that speed comparisons are utterly meaningless, and in general I agree, especially when you are comparing apples and oranges which is what I’m going to do here. I’m going to compare a couple of alternatives to lm()
, that can be used to run linear regressions in R, but that are more general than lm()
. One reason for doing this was to see how much performance you’d loose if you would use one of these tools to run a linear regression (even if you could have used lm()
). But as speed comparisons are utterly meaningless, my main reason for blogging about this is just to highlight a couple of tools you can use when you grown out of lm()
. The speed comparison was just to lure you in. Let’s run!
Below are the seven different methods that I’m going to compare by using each method to run the same linear regression. If you are just interested in the speed comparisons, just scroll to the bottom of the post. And if you are actually interested in running standard linear regressions as fast as possible in R, then Dirk Eddelbuettel has a nice post that covers just that.
lm()
This is the baseline, the “default” method for running linear regressions in R. If we have a data.frame
d
with the following layout:
head(d)
## y x1 x2
## 1 -64.579 -1.8088 -1.9685
## 2 -19.907 -1.3988 -0.2482
## 3 -4.971 0.8366 -0.5930
## 4 19.425 1.3621 0.4180
## 5 -1.124 -0.7355 0.4770
## 6 -12.123 -0.9050 -0.1259
Then this would run a linear regression with y
as the outcome variable and x1
and x2
as the predictors:
lm(y ~ 1 + x1 + x2, data=d)
##
## Call:
## lm(formula = y ~ 1 + x1 + x2, data = d)
##
## Coefficients:
## (Intercept) x1 x2
## -0.293 10.364 21.225
glm()
This is a generalization of lm()
that allows you to assume a number of different distributions for the outcome variable, not just the normal distribution as you are stuck with when using lm()
. However, if you don’t specify any distribution glm()
will default to using a normal distribution and will produce output identical to lm()
:
glm(y ~ 1 + x1 + x2, data=d)
##
## Call: glm(formula = y ~ 1 + x1 + x2, data = d)
##
## Coefficients:
## (Intercept) x1 x2
## -0.293 10.364 21.225
##
## Degrees of Freedom: 29 Total (i.e. Null); 27 Residual
## Null Deviance: 13200
## Residual Deviance: 241 AIC: 156
bayesglm()
Found in the arm
package, this is a modification of glm
that allows you to assume custom prior distributions over the coefficients (instead of the implicit flat priors of glm()
). This can be super useful, for example, when you have to deal with perfect separation in logistic regression or when you want to include prior information in the analysis. While there is bayes in the function name, note that bayesglm()
does not give you the whole posterior distribution, only point estimates. This is how to run a linear regression with flat priors, which should give similar results as when using lm()
:
library(arm)
bayesglm(y ~ 1 + x1 + x2, data = d, prior.scale=Inf, prior.df=Inf)
##
## Call: bayesglm(formula = y ~ 1 + x1 + x2, data = d, prior.scale = Inf,
## prior.df = Inf)
##
## Coefficients:
## (Intercept) x1 x2
## -0.293 10.364 21.225
##
## Degrees of Freedom: 29 Total (i.e. Null); 30 Residual
## Null Deviance: 13200
## Residual Deviance: 241 AIC: 156
nls()
While lm()
can only fit linear models, nls()
can also be used to fit non-linear models by least squares. For example, you could fit a sine curve to a data set with the following call: nls(y ~ par1 + par2 * sin(par3 + par4 * x ))
. Notice here that the syntax is a little bit different from lm()
as you have to write out both the variables and the parameters. Here is how to run the linear regression:
nls(y ~ intercept + x1 * beta1 + x2 * beta2, data = d)
## Nonlinear regression model
## model: y ~ intercept + x1 * beta1 + x2 * beta2
## data: d
## intercept beta1 beta2
## -0.293 10.364 21.225
## residual sum-of-squares: 241
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 3.05e-08
mle2()
In the bblme
package we find mle2()
, a function for general maximum likelihood estimation. While mle2()
can be used to maximize a handcrafted likelihood function, it also has a formula interface which is simple to use, but powerful, and that plays nice with R’s built in distributions. Here is how to roll a linear regression:
library(bbmle)
inits <- list(log_sigma = rnorm(1), intercept = rnorm(1),
beta1 = rnorm(1), beta2 = rnorm(1))
mle2(y ~ dnorm(mean = intercept + x1 * beta1 + x2 * beta2, sd = exp(log_sigma)),
start = inits, data = d)
##
## Call:
## mle2(minuslogl = y ~ dnorm(mean = intercept + x1 * beta1 + x2 *
## beta2, sd = exp(log_sigma)), start = inits, data = d)
##
## Coefficients:
## log_sigma intercept beta1 beta2
## 1.0414 -0.2928 10.3641 21.2248
##
## Log-likelihood: -73.81
Note, that we need to explicitly initialize the parameters before the maximization and that we now also need a parameter for the standard deviation. For an even more versatile use of the formula interface for building statistical models, check out the very cool rethinking
package by Richard McElreath.
optim()
Of course, if we want to be really versatile, we can craft our own log-likelihood function to maximized using optim()
, also part of base R. This gives us all the options, but there are also more things that can go wrong: We might make mistakes in the model specification and if the search for the optimal parameters is not initialized well the model might not converge at all! A linear regression log-likelihood could look like this:
log_like_fn <- function(par, d) {
sigma <- exp(par[1])
intercept <- par[2]
beta1 <- par[3]
beta2 <- par[4]
mu <- intercept + d$x1 * beta1 + d$x2 * beta2
sum(dnorm(d$y, mu, sigma, log=TRUE))
}
inits <- rnorm(4)
optim(par = inits, fn = log_like_fn, control = list(fnscale = -1), d = d)
## $par
## [1] 1.0399 -0.2964 10.3637 21.2139
##
## $value
## [1] -73.81
##
## $counts
## function gradient
## 431 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
As the convergence
returned 0
it hopefully worked fine (a 1
indicates non-convergence). The control = list(fnscale = -1)
argument is just there to make optim()
do maximum likelihood estimation rather than minimum likelihood estimation (which must surely be the worst estimation method ever).
optimizing()
Stan is a stand alone program that plays well with R, and that allows you to specify a model in Stan’s language which will compile down to very efficient C++ code. Stan was originally built for doing Hamiltonian Monte Carlo, but now also includes an optimizing()
function that, like R’s optim()
, allows you to do maximum likelihood estimation (or maximum a posteriori estimation, if you explicitly included priors in the model definition). Here we need to do a fair bit of work before we can fit a linear regression but what we gain is extreme flexibility in extending this model, would we need to. We have come a long way from lm
…
library(rstan)
## Loading required package: inline
##
## Attaching package: 'inline'
##
## The following object is masked from 'package:Rcpp':
##
## registerPlugin
##
## rstan (Version 2.6.0, packaged: 2015-02-06 21:02:34 UTC, GitRev: 198082f07a60)
##
## Attaching package: 'rstan'
##
## The following object is masked from 'package:arm':
##
## traceplot
model_string <- "
data {
int n;
vector[n] y;
vector[n] x1;
vector[n] x2;
}
parameters {
real intercept;
real beta1;
real beta2;
real<lower=0> sigma;
}
model {
vector[n] mu;
mu <- intercept + x1 * beta1 + x2 * beta2;
y ~ normal(mu, sigma);
}
"
data_list <- list(n = nrow(d), y = d$y, x1 = d$x1, x2 = d$x2)
model <- stan_model(model_code = model_string)
fit <- optimizing(model, data_list)
fit
## $par
## intercept beta1 beta2 sigma
## -0.2929 10.3642 21.2248 2.8331
##
## $value
## [1] -46.24
So, just for fun, here is the speed comparison, first for running a linear regression with 1000 data points and 5 predictors:
This should be taken with a huge heap of salt (which is not too good for your health!). While all these methods produce a result equivalent to a linear regression they do it in different ways, and not necessary in equally good ways, for example, my homemade optim
routine is not converging correctly when trying to fit a model with too many predictors. As I have used the standard settings there is surely a multitude of ways in which any of these methods can be made faster. Anyway, here is what happens if we vary the number of predictors and the number of data points:
To make these speed comparisons I used the microbenchmark
package, the full script replicating the plots above can be found here. This speed comparison was made on my laptop running R version 3.1.2, on 32 bit Ubuntu 12.04, with an average amount of RAM and a processor that is starting to get a bit tired.