Not surprisingly, this year’s UseR! conference was a great event with heaps of talented researchers and R-developers showing off the latest and greatest R packages. (A surprise visit from Donald Knuth didn’t hurt either.) What was extra great this year was that all talks were recorded, including mine. So if you want to know more about how the non-parametric Bootstrap is really a Bayesian procedure, and how you can run the Bayesian bootstrap in R using my bayesboot package, just press play. :)
If you want to know even more you can take a look at the the slides, the abstract, and the package.
Since almost every presentation at useR! 2016 was filmed you can head over and start browsing right now. Here are some recommendations if you don’t know where to start:
FiveThirtyEight’s data journalism workflow with R. In this fun and easy going presentation Andrew Flowers describes how R is used at the well known data driven news site FiveThirtyEight.
Notebooks with R Markdown. J.J. Allaire shows off the notebook (à la iPython) feature which will be integrated in future versions of Rstudio.
jailbreakr: Get out of Excel, free. Jenny Bryan successfully explains why spreadsheets aren’t that horrible and introduces jailbreakr, an ambitious new package with the goal intelligently importing spreadsheets into R.
One new (to me) package that made me exited was the future package by Henrik Bengtsson. A future is a programming language construct that acts as a stand-in for a value that is not computed yet and, here is the cool part, can be computed in parallel while the program is doing something else. Using futures is therefore a way of parallelizing your code, and now R has them! The future package implements a number of functions, but the real magic is the operator %<-%
which creates a future promise. It’s easier to show than to explain:
library(future)
plan(multiprocess) # For cross platform parallel computation
slow_computation_1 <- function() { Sys.sleep(5); 40 }
slow_computation_2 <- function() { Sys.sleep(5); 2 }
system.time({
a %<-% slow_computation_1() # This happens at the same time as ...
b %<-% slow_computation_2() # ...this and the program will wait ...
cat("The answer is", a + b, "\n") # ... here for the functions to finish.
})
## The answer is 42
## user system elapsed
## 0.052 0.000 5.053
So (almost) all you have to do is to replace <-
with %<-%
and suddenly your program runs in parallel! The recording of the future package presentation unfortunately suffers from bad sound quality, but Henrik has written a detailed vignette that explains the package.
All in all, another great UseR! and I’m already looking forward to the next one in Brussels! :)
]]>Today I’m extraordinarily pleased because today I solved an actuall real world problem using R. Sure, I’ve solved many esoteric statistical problems with R, but I’m not sure if any of those solutions have escaped the digital world and made some impact ex silico.
It is now summer and in Sweden that means that many people tend to overhaul and rebuild their wooden decks as you need somewhere to sit during those precious few weeks of +20°C (70° F) weather. And so, we also decided to rebuild our algae ridden, half-rotten deck and everything went well until we got to the point where we had to construct the last steps leading into the house. As we had been slightly sloppy when buying planks we only had five left, and when naïvely measuring out the lengths we needed it seemed that the planks were not long enough. Now the problem was this: Was there some way we could saw the planks into the lengths we needed or did we have to go all the way to the lumber yard to get more planks?
These were the planks we had (in centimeters):
planks_we_have <- c(120, 137, 220, 420, 480)
And these were the planks lengths we wanted (again in cm):
planks_we_want <- c(19, 19, 19, 19, 79, 79, 79, 103, 103,
103, 135, 135, 135, 135, 160)
If you just naïvely saw the smallest planks into the smallest plank lengths you’ll end up sawing the following:
120 -> 19, 19, 19, 19
137 -> 79
220 -> 79, 79
420 -> 103, 103, 103
480 -> 135, 135, 135
But using this “algorithm” we end up lacking material for the 135 cm and the 160 cm plank! However, it could be the case that if we just saw the plank lengths in a smarter way the planks we have would suffice. At this point I could have exclaimed “Ah, but isn’t this problem just a special case of the multiple knapsack problem?! Finally my course in Algorithm theory will pay off!” (but I really didn’t).
The knapsack problem is a famous problem in computer science where the objective is to find the combination of items that has the highest total value under the constraint that their total weight is less than a given weight. My mental image of this problem is that of a thief trying to stuff a knapsack with the most valuable goods under the constraint that s/he can only carry a certain weight. The multiple knapsack problem is just the generalization where the are more than one knapsack. In our case each plank can be seen as a “knapsack” where each plank length is an “item”” we can allocate to a plank. If we set the value of each plank length to its actual length (so the 79 cm plank length is worth 79), and solve the multiple knapsack problem, we’ll end up getting the most sawed plank pieces possible given the material we have!
Now, we could code up a brute force solution in R but, as we wanted the deck built sooner than later, I first did a search on CRAN and was happy to find the adagio package which contains “Discrete and Global Optimization Routines”. Among these we find the mknapsack
function which given vectors of item values, item weights and knapsack capacities solves the multiple knapsack problem. It returns a list containing the vector ksack
that indicates what knapsack each item should go into. Here is now the R code to solve our plank sawing problem:
library(adagio)
# mknapsack calling signature is: mknapsack(values, weights, capacities)
solution <- mknapsack(planks_we_want, planks_we_want + 1, planks_we_have)
# Above I added +1 cm to each length to compensate for the loss when sawing.
solution$ksack
## [1] 1 4 4 1 1 3 4 5 5 5 5 4 2 3 4
# That is, cut plank length 1 from plank 1, plank length 2 from plank 4, etc.
# Now pretty printing what to cut so that we don't make mistakes...
assignment <- data.frame(
cut_this = planks_we_have[solution$ksack],
into_this = planks_we_want)
t(assignment[order(assignment[,1]), ])
## 1 4 5 13 6 14 2 3 7 12 15 8 9 10 11
## cut_this 120 120 120 137 220 220 420 420 420 420 420 480 480 480 480
## into_this 19 19 79 135 79 135 19 19 79 135 160 103 103 103 135
Tada! Turns out our existing planks were long enough to saw into the pieces we wanted after all. I guess we could have figured this out by ourselves, but thanks to the adagio package the R solution was pretty painless. Here are the planks post saw:
And here is the final result:
So, I guess it was a good thing that I studied that algorithms course and that I took a break from C++ to learn R. Now I’m just waiting for the R-package that does all the sawing and hammering for you…
]]>I recently wrapped up a version of my R function for easy Bayesian bootstrappin’ into the package bayesboot
. This package implements a function, also named bayesboot
, which performs the Bayesian bootstrap introduced by Rubin in 1981. The Bayesian bootstrap can be seen as a smoother version of the classical non-parametric bootstrap, but I prefer seeing the classical bootstrap as an approximation to the Bayesian bootstrap :)
The implementation in bayesboot
can handle both summary statistics that works on a weighted version of the data (such as weighted.mean
) and that works on a resampled data set (like median
). As bayesboot
just got accepted on CRAN you can install it in the usual way:
install.packages("bayesboot")
You’ll find the source code for bayesboot
on GitHub.
If you want to know more about the model behind the Bayesian bootstrap you can check out my previous blog post on the subject and, of course, the original paper by Rubin (1981).
bayesboot
in actionAs in a previous post on the Bayesian bootstrap, here is again a Bayesian bootstrap analysis of the mean height of American presidents using the heights of the last ten presidents:
# Heights of the last ten American presidents in cm (Kennedy to Obama).
heights <- c(183, 192, 182, 183, 177, 185, 188, 188, 182, 185)
The bayesboot
function needs, at least, a vector of data and a function implementing a summary statistic. Here we have the data height
and we’re going with the sample mean
as our summary statistic:
library(bayesboot)
b1 <- bayesboot(heights, mean)
The resulting posterior distribution over probable mean heights can now be plot
ted and summary
ized:
summary(b1)
## Bayesian bootstrap
##
## Number of posterior draws: 4000
##
## Summary of the posterior (with 95% Highest Density Intervals):
## statistic mean sd hdi.low hdi.high
## V1 184.5 1.181 182.1 186.8
##
## Quantiles:
## statistic q2.5% q25% median q75% q97.5%
## V1 182.2 183.7 184.5 185.3 186.9
##
## Call:
## bayesboot(data = heights, statistic = mean)
plot(b1)
A shout-out to Mike Meredith and John Kruschke who implemented the great BEST and HDInterval packages which summary
and plot
utilizes. Note here that the point mean in the summary and plot above refers to the mean of the posterior distribution and not the sample mean of any presidents.
While it is possible to use a summary statistic that works on a resample of the original data, it is more efficient to use a summary statistic that works on a reweighting of the original dataset. So instead of using mean
as above it would be better to use weighted.mean
, like this:
b2 <- bayesboot(heights, weighted.mean, use.weights = TRUE)
The result will be almost the same as before, but the above will be somewhat faster to compute.
The result of a call to bayesboot
will always result in a data.frame
with one column per dimension of the summary statistic. If the summary statistic does not return a named vector the columns will be called V1
, V2
, etc. The result of a bayesboot
call can be further inspected and post processed. For example:
# Given the model and the data, this is the probability that the mean
# heights of American presidents is above the mean heights of
# American males as given by www.cdc.gov/nchs/data/series/sr_11/sr11_252.pdf
mean( c(b2$V1 > 175.9, TRUE, FALSE) )
## [1] 0.9998
If we want to compare the means of two groups, we will have to call bayesboot
twice with each dataset and then use the resulting samples to calculate the posterior difference. For example, let’s say we have the heights of the opponents that lost to the presidents in height
the first time those presidents were elected. Now we are interested in comparing the mean height of American presidents with the mean height of presidential candidates that lost.
# The heights of opponents of American presidents (first time they were elected).
# From Richard Nixon to John McCain
heights_opponents <- c(182, 180, 180, 183, 177, 173, 188, 185, 175)
# Running the Bayesian bootstrap for both datasets
b_presidents <- bayesboot(heights, weighted.mean, use.weights = TRUE)
b_opponents <- bayesboot(heights_opponents, weighted.mean, use.weights = TRUE)
# Calculating the posterior difference and converting back to a
# bayesboot object for pretty plotting.
b_diff <- as.bayesboot(b_presidents - b_opponents)
plot(b_diff)
So there is some evidence that winning presidents are a couple of cm taller than loosing opponents. (Though, I must add that it is quite unclear what the purpose really is of analyzing the heights of presidents and opponents…)
The README and documentation of bayesboot
contains more examples. If you find any bugs or have suggestions for improvements consider submitting an issue on GitHub.
Rubin, D. B. (1981). The Bayesian bootstrap. The annals of statistics, 9(1), 130–134. link to paper
]]>Bayesian data analysis is cool, Markov chain Monte Carlo is the cool technique that makes Bayesian data analysis possible, and wouldn’t it be coolness if you could do all of this in the browser? That was what I thought, at least, and I’ve now made bayes.js: A small JavaScript library that implements an adaptive MCMC sampler and a couple of probability distributions, and that makes it relatively easy to implement simple Bayesian models in JavaScript.
Here is a motivating example: Say that you have the heights of the last ten American presidents…
// The heights of the last ten American presidents in cm, from Kennedy to Obama
var heights = [183, 192, 182, 183, 177, 185, 188, 188, 182, 185];
… and that you would like to fit a Bayesian model assuming a Normal distribution to this data. Well, you can do that right now by clicking “Start sampling” below! This will run an MCMC sampler in your browser implemented in JavaScript.
If this doesn’t seem to work in your browser, for some reason, then try this version of the demo.
Here is the model you just sampled from…
$$\mu \sim \text{Normal}(0, 1000) \ \sigma \sim \text{Uniform}(0, 1000) \ \text{heights}_i \sim \text{Normal}(\mu, \sigma) ~~~ \text{for} ~ i ~ \text{in} 1..n$$
… and this is how it is implemented in JavaScript:
/* The code below assumes that you have loaded the two modules of bayes.js:
* - mcmc.js which implements the sampler and creates the global
* object mcmc.
* - distributions.js which implements a number of log density functions
* for common probability distributions and that creates the global object
* ld (as in log density).
*/
// The data
var heights = [183, 192, 182, 183, 177, 185, 188, 188, 182, 185];
// Parameter definitions
var params = {
mu: {type: "real"},
sigma: {type: "real", lower: 0}};
// Model definition
var log_post = function(state, heights) {
var log_post = 0;
// Priors (here sloppy and vague...)
log_post += ld.norm(state.mu, 0, 1000);
log_post += ld.unif(state.sigma, 0, 1000);
// Likelihood
for(var i = 0; i < heights.length; i++) {
log_post += ld.norm(heights[i], state.mu, state.sigma);
}
return log_post;
};
// Initializing the sampler, burning some draws to the MCMC gods,
// and generating a sample of size 1000.
var sampler = new mcmc.AmwgSampler(params, log_post, heights);
sampler.burn(1000);
var samples = sampler.sample(1000);
I’ve implemented a JavaScript MCMC procedure for fitting a Bayesian model before, but that was just for a specific model (I also implemented a MCMC procedure in BASIC, but don’t ask me why…). The idea with bayes.js is to make it easier for me (and maybe for you) to make demos of Bayesian procedures that are easy to put online. If you would like to know more about bayes.js just head over to it’s GitHub page where you will find the code and a README file full of details. You can also check out a couple of interactive demos that I’ve made:
AmwgSampler
can do…)These demos rely on the plotly library and I haven’t tested them extensively on different platforms/browsers. You should be able to change the data and model definition on the fly (but if you change some stuff, like adding multidimensional variables, the plotting might stop working).
The two major files in bayes.js are:
AmwgSampler
) algorithm presented by Roberts and Rosenthal (2009) . Loading this file in the browser creates the global object mcmc
.ld.*
(for example, ld.norm
and ld.pois
) and uses the same parameters as the d*
density functions in R. Loading this file in the browser creates the global object ld
.In addition to this the whole thing is wrapped in an Rstudio project as I’ve use R and JAGS to write some tests.
ld.norm
defined in distributions.js resulted in 10x slower sampling on Firefox 37.Roberts, G. O., & Rosenthal, J. S. (2009). Examples of adaptive MCMC. Journal of Computational and Graphical Statistics, 18(2), 349-367. pdf
]]>