I recently wrapped up a version of my R function for easy Bayesian bootstrappin’ into the package bayesboot
. This package implements a function, also named bayesboot
, which performs the Bayesian bootstrap introduced by Rubin in 1981. The Bayesian bootstrap can be seen as a smoother version of the classical non-parametric bootstrap, but I prefer seeing the classical bootstrap as an approximation to the Bayesian bootstrap :)
The implementation in bayesboot
can handle both summary statistics that works on a weighted version of the data (such as weighted.mean
) and that works on a resampled data set (like median
). As bayesboot
just got accepted on CRAN you can install it in the usual way:
install.packages("bayesboot")
You’ll find the source code for bayesboot
on GitHub.
If you want to know more about the model behind the Bayesian bootstrap you can check out my previous blog post on the subject and, of course, the original paper by Rubin (1981).
bayesboot
in actionAs in a previous post on the Bayesian bootstrap, here is again a Bayesian bootstrap analysis of the mean height of American presidents using the heights of the last ten presidents:
# Heights of the last ten American presidents in cm (Kennedy to Obama).
heights <- c(183, 192, 182, 183, 177, 185, 188, 188, 182, 185)
The bayesboot
function needs, at least, a vector of data and a function implementing a summary statistic. Here we have the data height
and we’re going with the sample mean
as our summary statistic:
library(bayesboot)
b1 <- bayesboot(heights, mean)
The resulting posterior distribution over probable mean heights can now be plot
ted and summary
ized:
summary(b1)
## Bayesian bootstrap
##
## Number of posterior draws: 4000
##
## Summary of the posterior (with 95% Highest Density Intervals):
## statistic mean sd hdi.low hdi.high
## V1 184.5 1.181 182.1 186.8
##
## Quantiles:
## statistic q2.5% q25% median q75% q97.5%
## V1 182.2 183.7 184.5 185.3 186.9
##
## Call:
## bayesboot(data = heights, statistic = mean)
plot(b1)
A shout-out to Mike Meredith and John Kruschke who implemented the great BEST and HDInterval packages which summary
and plot
utilizes. Note here that the point mean in the summary and plot above refers to the mean of the posterior distribution and not the sample mean of any presidents.
While it is possible to use a summary statistic that works on a resample of the original data, it is more efficient to use a summary statistic that works on a reweighting of the original dataset. So instead of using mean
as above it would be better to use weighted.mean
, like this:
b2 <- bayesboot(heights, weighted.mean, use.weights = TRUE)
The result will be almost the same as before, but the above will be somewhat faster to compute.
The result of a call to bayesboot
will always result in a data.frame
with one column per dimension of the summary statistic. If the summary statistic does not return a named vector the columns will be called V1
, V2
, etc. The result of a bayesboot
call can be further inspected and post processed. For example:
# Given the model and the data, this is the probability that the mean
# heights of American presidents is above the mean heights of
# American males as given by www.cdc.gov/nchs/data/series/sr_11/sr11_252.pdf
mean( c(b2$V1 > 175.9, TRUE, FALSE) )
## [1] 0.9998
If we want to compare the means of two groups, we will have to call bayesboot
twice with each dataset and then use the resulting samples to calculate the posterior difference. For example, let’s say we have the heights of the opponents that lost to the presidents in height
the first time those presidents were elected. Now we are interested in comparing the mean height of American presidents with the mean height of presidential candidates that lost.
# The heights of opponents of American presidents (first time they were elected).
# From Richard Nixon to John McCain
heights_opponents <- c(182, 180, 180, 183, 177, 173, 188, 185, 175)
# Running the Bayesian bootstrap for both datasets
b_presidents <- bayesboot(heights, weighted.mean, use.weights = TRUE)
b_opponents <- bayesboot(heights_opponents, weighted.mean, use.weights = TRUE)
# Calculating the posterior difference and converting back to a
# bayesboot object for pretty plotting.
b_diff <- as.bayesboot(b_presidents - b_opponents)
plot(b_diff)
So there is some evidence that winning presidents are a couple of cm taller than loosing opponents. (Though, I must add that it is quite unclear what the purpose really is of analyzing the heights of presidents and opponents…)
The README and documentation of bayesboot
contains more examples. If you find any bugs or have suggestions for improvements consider submitting an issue on GitHub.
Rubin, D. B. (1981). The Bayesian bootstrap. The annals of statistics, 9(1), 130–134. link to paper
]]>Bayesian data analysis is cool, Markov chain Monte Carlo is the cool technique that makes Bayesian data analysis possible, and wouldn’t it be coolness if you could do all of this in the browser? That was what I thought, at least, and I’ve now made bayes.js: A small JavaScript library that implements an adaptive MCMC sampler and a couple of probability distributions, and that makes it relatively easy to implement simple Bayesian models in JavaScript.
Here is a motivating example: Say that you have the heights of the last ten American presidents…
// The heights of the last ten American presidents in cm, from Kennedy to Obama
var heights = [183, 192, 182, 183, 177, 185, 188, 188, 182, 185];
… and that you would like to fit a Bayesian model assuming a Normal distribution to this data. Well, you can do that right now by clicking “Start sampling” below! This will run an MCMC sampler in your browser implemented in JavaScript.
If this doesn’t seem to work in your browser, for some reason, then try this version of the demo.
Here is the model you just sampled from…
$$\mu \sim \text{Normal}(0, 1000) \ \sigma \sim \text{Uniform}(0, 1000) \ \text{heights}_i \sim \text{Normal}(\mu, \sigma) ~~~ \text{for} ~ i ~ \text{in} 1..n$$
… and this is how it is implemented in JavaScript:
/* The code below assumes that you have loaded the two modules of bayes.js:
* - mcmc.js which implements the sampler and creates the global
* object mcmc.
* - distributions.js which implements a number of log density functions
* for common probability distributions and that creates the global object
* ld (as in log density).
*/
// The data
var heights = [183, 192, 182, 183, 177, 185, 188, 188, 182, 185];
// Parameter definitions
var params = {
mu: {type: "real"},
sigma: {type: "real", lower: 0}};
// Model definition
var log_post = function(state, heights) {
var log_post = 0;
// Priors (here sloppy and vague...)
log_post += ld.norm(state.mu, 0, 1000);
log_post += ld.unif(state.sigma, 0, 1000);
// Likelihood
for(var i = 0; i < heights.length; i++) {
log_post += ld.norm(heights[i], state.mu, state.sigma);
}
return log_post;
};
// Initializing the sampler, burning some draws to the MCMC gods,
// and generating a sample of size 1000.
var sampler = new mcmc.AmwgSampler(params, log_post, heights);
sampler.burn(1000);
var samples = sampler.sample(1000);
I’ve implemented a JavaScript MCMC procedure for fitting a Bayesian model before, but that was just for a specific model (I also implemented a MCMC procedure in BASIC, but don’t ask me why…). The idea with bayes.js is to make it easier for me (and maybe for you) to make demos of Bayesian procedures that are easy to put online. If you would like to know more about bayes.js just head over to it’s GitHub page where you will find the code and a README file full of details. You can also check out a couple of interactive demos that I’ve made:
AmwgSampler
can do…)These demos rely on the plotly library and I haven’t tested them extensively on different platforms/browsers. You should be able to change the data and model definition on the fly (but if you change some stuff, like adding multidimensional variables, the plotting might stop working).
The two major files in bayes.js are:
AmwgSampler
) algorithm presented by Roberts and Rosenthal (2009) . Loading this file in the browser creates the global object mcmc
.ld.*
(for example, ld.norm
and ld.pois
) and uses the same parameters as the d*
density functions in R. Loading this file in the browser creates the global object ld
.In addition to this the whole thing is wrapped in an Rstudio project as I’ve use R and JAGS to write some tests.
ld.norm
defined in distributions.js resulted in 10x slower sampling on Firefox 37.Roberts, G. O., & Rosenthal, J. S. (2009). Examples of adaptive MCMC. Journal of Computational and Graphical Statistics, 18(2), 349-367. pdf
]]>On the 21st of February, 2015, my wife had not had her period for 33 days, and as we were trying to conceive, this was good news! An average period is around a month, and if you are a couple trying to go triple, then a missing period is a good sign something is going on. But at 33 days, this was not yet a missing period, just a late one, so how good news was it? Pretty good, really good, or just meh?
To get at this I developed a simple Bayesian model that, given the number of days since your last period and your history of period onsets, calculates the probability that you are going to be pregnant this period cycle. In this post I will describe what data I used, the priors I used, the model assumptions, and how to fit it in R using importance sampling. And finally I show you why the result of the model really didn’t matter in the end. Also I’ll give you a handy script if you want to calculate this for yourself. :)
During the last part of 2014 my wife kept a journal of her period onsets, which was good luck for me, else I would end up with tiny data again. In total we had the dates for eight period onsets, but the data I used was not the onsets but the number of days between the onsets:
period_onset <- as.Date(c("2014-07-02", "2014-08-02", "2014-08-29", "2014-09-25",
"2014-10-24", "2014-11-20", "2014-12-22", "2015-01-19"))
days_between_periods <- as.numeric(diff(period_onset))
So the onsets occur pretty regularly, hovering around a cycle of 28 days. The last onset was on the 19th of January, so on the 21st of February there had been 33 days since the last onset.
I was constructing a model covering period cycles, pregnancies and infertility, and as such it was obviously going to make huge simplifications. Some general assumptions I made were:
Now to the specific assumptions I made:
days_between_periods
) is assumed to be normally distributed with unknown mean (mean_period
) and standard deviation (sd_period
).0.19
(more about where this number comes from below) if you are fertile as a couple (is_fertile
). Unfortunately not all couples are fertile, and if you are not then the probability of getting pregnant is 0. If fertility is coded as 0-1 then this can be compactly written as 0.19 * is_fertile
.n_non_pregnant_periods
)
is then (1 - 0.19 * is_fertile)^n_non_pregnant_periods
next_period
) is going to be more than the current number of days since the last period (days_since_last_period
). That is, the probability of next_period < days_since_last_period
is zero. This sounds strange because it is so obvious, but we’re going to need it in the model.That was basically it! But in order to fit this I was going to need a likelihood function, a function that, given fixed parameters and some data, calculates the probability of the data given those parameters or, more commonly, something proportional to a probability, that is, a likelihood. And as this likelihood can be extremely tiny I needed to calculate it on the log scale to avoid numerical problems. When crafting a log likelihood function in R, the general pattern is this:
log_like <- 0.0
).dnorm
, dbinom
and dpois
) you calculate the likelihoods of the different the parts of the model. You then multiply these likelihoods together. On the log scale this corresponds to adding the log likelihoods to log_like
.d*
functions return log likelihoods just add the argument log = TRUE
. Also remember that a likelihood of 0.0 corresponds to a log likelihood of -Inf
. So, a log likelihood function corresponding to the model above would then be:
calc_log_like <- function(days_since_last_period, days_between_periods,
mean_period, sd_period, next_period,
is_fertile, is_pregnant) {
n_non_pregnant_periods <- length(days_between_periods)
log_like <- 0
if(n_non_pregnant_periods > 0) {
log_like <- log_like + sum( dnorm(days_between_periods, mean_period, sd_period, log = TRUE) )
}
log_like <- log_like + log( (1 - 0.19 * is_fertile)^n_non_pregnant_periods )
if(!is_pregnant && next_period < days_since_last_period) {
log_like <- -Inf
}
log_like
}
Here the data is the scalar days_since_last_period
and the vector days_between_periods
, and the rest of the arguments are the parameters to be estimated. Using this function I could now get the log likelihood for any data + parameter combination. However, I still only had half a model, I also needed priors!
To complete this model I needed priors on all the parameters. That is, I had to specify what information the model has before seeing the data. Specifically, I needed priors on mean_period
, sd_period
, is_fertile
, and is_pregnant
(while next_period
is also a parameter, I didn’t need to give it an explicit prior as its distribution is completely specified by mean_period
and sd_period
). I also needed to find a value for the probability of becoming pregnant in a cycle (which I set to 0.19
above). Did I use vague, “objective” priors here? No, I went looking in the fertility literature to something more informative!
For the distribution of the days_between_periods
the parameters were mean_period
and sd_period
. Here I used estimates from the article The normal variabilities of the menstrual cycle
(Cole et al, 2009) which measured the regularity of periods in 184 women aged 18-36 years. The grand mean number of days between periods was here 27.7 days, with the SD of the per participant mean being 2.4. The group SD of the number of days between periods was 1.6. Given these estimates I then decided to put a Normal(27.7, 2.4) distribution over mean_period
and a Half-Normal distribution with mean 1.6 over sd_period
, corresponding to a Half-Normal with a SD of 2.05. Here they are:
For the parameters is_fertile
and is_pregnant
I based the priors on frequencies. The proportion of couples that are fertile is tricky to define, as there different definitions of infertility. Van Geloven et al. (2013) made a small literature review and got that between 2% and 5% of all couples could be considered infertile. As I’ve seen numbers as high as 10%, I decided to go with the higher end of this range and put a prior probability of 100% - 5% = 95% that a couple is fertile.
is_pregnant
is a binary parameter standing for whether the couple are going get (or already are) pregnant the current cycle. The prior I used here was the probability of getting pregnant in a cycle. This probability is obviously 0.0 if the couple is infertile, but how large a proportion of active, fertile couples get pregnant in a period cycle? Unfortunately I didn’t find a source that explicitly stated this, but I found something close. On page 53 in Increased Infertility With Age in Men and Women Dunson et al. (2004) give the proportion of couples trying to conceive who did not get pregnant within 12 cycles, stratified by the age of the woman:
prop_not_preg_12_cycles <- c("19-26 years" = 0.08,
"27-34 years" = 0.13,
"35-39 years" = 0.18)
Using some back-of-the-R-script calculations I calculated the probability to conceive in a cycle: As these proportions presumably include infertile couples I started by subtracting 0.05, the proportion of couples that I assumed are infertile. My wife was in the 27-34 years bracket so the probability of us not conceiving within 12 cycles, given that we are fertile, was then 0.13 - 0.05. If p is is the probability of not getting pregnant during one cycle, then $p^{12} = 0.13 - 0.05$ is the probability of not getting pregnant during twelve cycles and, as p is positive, we have that $p = (0.135 - 0.05)^{1/12}$. The probability of getting pregnant in one cycle is then 1 - p and the probabilities for the three age groups are:
1 - (prop_not_preg_12_cycles - 0.05)^(1/12)
## 19-26 years 27-34 years 35-39 years
## 0.25 0.19 0.16
So that’s where the 19% percent probability of conceiving came from in the log likelihood function above, and 19% is what I used as a prior for is_pregnant
. Now I had priors for all parameters and I could construct a function that returned samples from the prior:
sample_from_prior <- function(n) {
prior <- data.frame(mean_period = rnorm(n, 27.7, 2.4),
sd_period = abs(rnorm(n, 0, 2.05)),
is_fertile = rbinom(n, 1, 0.95))
prior$is_pregnant <- rbinom(n, 1, 0.19 * prior$is_fertile)
prior$next_period <- rnorm(n, prior$mean_period, prior$sd_period)
prior$next_period[prior$is_pregnant == 1] <- NA
prior
}
It takes one argument (n
) and returns a data.frame
with n
rows, each row being a sample from the prior. Let’s try it out:
sample_from_prior(n = 4)
## mean_period sd_period is_fertile is_pregnant next_period
## 1 29 1.24 1 0 30
## 2 29 3.73 1 0 28
## 3 27 1.29 1 1 NA
## 4 27 0.57 0 0 27
Notice that is_pregnant
can only be 1
if is_fertile
is 1
, and that there is no next_period
if the couple is_pregnant
.
I had now collected the triforce of Bayesian statistics: The prior, the likelihood and the data. There are many algorithms I could have used to fit this model, but here a particularly convenient method was to use importance sampling. I’ve written about importance sampling before, but let’s recap: Importance sampling is a Monte Carlo that is very easy to setup and that can work well if (1) the parameters space is small and (2) the priors are not too dissimilar from the posterior. As my parameter space was small and because I used pretty informative priors I though importance sampling would suffice here. The three basic steps in importance sampling are:
sample_from_prior
.)calc_log_like
.)sample
.)(Note that there are some variations to this procedure, but when used to fit a Bayesian model this is a common version of importance sampling.)
The result of using importance sampling is a new sample which, if the importance sampling worked OK, can be treated as a sample from the posterior. That is, it represents what the model knows after having seen the data. Since I already had defined sample_from_prior
and calc_log_like
, defining a function in R doing importance sampling was trivial:
sample_from_posterior <- function(days_since_last_period, days_between_periods, n_samples) {
prior <- sample_from_prior(n_samples)
log_like <- sapply(1:n_samples, function(i) {
calc_log_like(days_since_last_period, days_between_periods,
prior$mean_period[i], prior$sd_period[i], prior$next_period[i],
prior$is_fertile[i], prior$is_pregnant[i])
})
posterior <- prior[ sample(n_samples, replace = TRUE, prob = exp(log_like)), ]
posterior
}
So, on the 21st of February, 2015, my wife had not had her period for 33 days. Was this good news? Let’s run the model and find out!
post <- sample_from_posterior(33, days_between_periods, n_samples = 100000)
post
is now a long data frame where the distribution of the parameter values represent the posterior information regarding those parameters.
head(post)
## mean_period sd_period is_fertile is_pregnant next_period
## 33231 28 2.8 0 0 37
## 22386 27 2.4 1 1 NA
## 47489 27 2.1 1 1 NA
## 68312 28 2.3 1 1 NA
## 37341 29 2.9 1 1 NA
## 57957 30 2.6 1 0 36
Let’s start by looking at the mean and standard deviation of the number of days between each period:
As expected the posteriors are more narrow than the priors and, looking at the posteriors, it’s probable that the mean period cycle is around 29 days with a SD of 2-3 days. Now to the important questions: What’s the probability that we are a fertile couple and what’s the probability that we were pregnant on the 21st of February? To calculate this we can just take post$is_fertile
and post$is_pregnant
and calculate the proportion of 1
s in these vectors. A quick way of doing this is just to take the mean
:
mean(post$is_fertile)
## [1] 0.97
mean(post$is_pregnant)
## [1] 0.84
So it was pretty good news: It’s very probable that we are a fertile couple and the probability that we were pregnant was 84%! Using this model I could also see how the probability of us being pregnant would change if the period onset would stay away a couple of days more:
post <- sample_from_posterior(34, days_between_periods, n_samples = 100000)
mean(post$is_pregnant)
## [1] 0.92
post <- sample_from_posterior(35, days_between_periods, n_samples = 100000)
mean(post$is_pregnant)
## [1] 0.96
Yeah, while we are at it, why not see how the probability of us being fertile and pregnant changed during the months we were trying to conceive:
So, this make sense. As the time since the last period gets longer the probability that we are going to be pregnant the current cycle increases, but as soon as there is a period onset that probability falls down to baseline again. We see the same pattern for the probability of being fertile, but for every period cycle we didn’t get pregnant the probability of us being fertile gets slightly lower. Both these graphs are a bit jagged, but this is just due to the variability of the importance sampling algorithm. Also note that, while the graphs above are pretty, there is no real use looking at the probability over time, the only thing that’s informativ and matters is the current probability.
But all of this doesn’t really matter now. And the probabilities I calculated do not matter either. Before-the-fact probabilities reflect uncertainty regarding an outcome or a statement, but after-the-fact there is no uncertainty left. What uncertainty the probabilities represented is gone. I’m certain that my wife and I were pregnant on the 21st of February, and I know this because one week ago, on the 29th of October, we received this little guy:
If probabilities matter to you you can find a script implementing this model here which you can run on your own data, or a friend’s. In this post I left out the code creating the plots, but all of that can be found here. And, as this post was really just an excuse to post baby photos, here are some more of me and my son checking out the statistical literature:
Doing Bayesian Data Analysis makes him a bit sleepy, but that’s OK, he’ll come around! Looking at Fisher’s Statistical Methods for Research Workers on the other hand makes him furious…
Bortot, P., Masarotto, G., & Scarpa, B. (2010). Sequential predictions of menstrual cycle lengths. Biostatistics, 11(4), 741-755. doi: 10.1093/biostatistics/kxq020
Cole, L. A., Ladner, D. G., & Byrn, F. W. (2009). The normal variabilities of the menstrual cycle. Fertility and sterility, 91(2), 522-527. doi: 10.1016/j.fertnstert.2007.11.073
Dunson, D. B., Baird, D. D., & Colombo, B. (2004). Increased infertility with age in men and women. Obstetrics & Gynecology, 103(1), 51-56. doi: 10.1097/01.AOG.0000100153.24061.45
Van Geloven, N., Van der Veen, F., Bossuyt, P. M. M., Hompes, P. G., Zwinderman, A. H., & Mol, B. W. (2013). Can we distinguish between infertility and subfertility when predicting natural conception in couples with an unfulfilled child wish?. Human Reproduction, 28(3), 658-665. doi: 10.1093/humrep/des428
Wilcox, A. J., Dunson, D., & Baird, D. D. (2000). The timing of the “fertile window” in the menstrual cycle: day specific estimates from a prospective study. Bmj, 321(7271), 1259-1262. doi: 10.1136/bmj.321.7271.1259, pdf
]]>
Romantic kissing is a cultural universal, right? Nope! At least not if you are to believe Jankowiak et al. (2015) who surveyed a large number of cultures and found that “sexual-romantic kissing” occurred in far from all of them. For some reasons the paper didn’t include a world map with these kissers and non-kissers plotted out. So, with the help of my colleague Andrey Anikin I’ve now made such a map using R and the excellent leaflet package. Click on the image below to check it out:
Jankowiak, W.R., Volsche, S.L. & Garcia, J.R., 2015. Is the Romantic-Sexual Kiss a Near Human Universal? American Anthropologist, 117(3), 535-539. doi: 10.1111/aman.12286 pdf
]]>