Making a slight digression from last month’s Probable Points and Credible Intervals here is how to summarize a 2D posterior density using a highest density ellipse. This is a straight forward extension of the highest density interval to the situation where you have a two-dimensional posterior (say, represented as a two column matrix of samples
) and you want to visualize what region, containing a given proportion of the probability, that has the most probable parameter combinations. So let’s first have a look at a fictional 2D posterior by using a simple scatter plot:
plot(samples)
Whoa… that’s some serious over-plotting and it’s hard to see what’s going on. Sure, the bulk of the posterior is somewhere in that black hole, but where exactly and how much of it?
A highest posterior density ellipse shows this by covering the area that contains the most probable parameter combinations while containing p% of the posterior probability. Like finding the highest density interval corresponds to finding the shortest interval containing p% of the probability, finding the highest density ellipse corresponds to finding the smallest ellipse containing p% of the probability a.k.a. the minimum volume ellipse. I have spent a lot of time trying to figure out how compute minimum volume ellipses. Wasted time, it turns out, as it can be easily computed using packages that come with R, you just have to know what you are looking for. If you just want the code skip over the next paragraph, if you want to know the tiny bit of detective work I had to do to figure this out, read on.
To find the points in sample
that are included in a minimum volume ellipse covering, say, 75% of the samples you can use cov.mve(samples, quantile.used = nrow(samples) * 0.75)
from the MASS package, here quantile.used
specifies the number of points in samples
that should be inside the ellipse. It uses an approximation algorithm described by Van Aelst, S. and Rousseeuw, P. (2009) that is not guaranteed to find the minimum volume ellipse but that will often be pretty close. A problem is that cov.mve
does not return the actual ellipse, it returns a robustly measured covariance matrix, but that’s not really what we are after. It does return an object that contains the indices of the points that are covered by the minimum volume ellipse, if fit
is the object returned by cov.mve
then these points can be extracted like this: points_in_ellipse <- samples[fit$best, ]
. To find the ellipse we are going to use ellipsoidhull
from the cluster package on the points_in_ellipse
. It returns an object which represents the minimum volume ellipse and by using its predict
function we get a two column matrix with points that lie on the hull of the ellipse and that we can finally plot.
That wasn’t too easy to figure out, but it’s pretty easy to do. The code below plots a 75% minimum volume / highest density ellipse:
library(MASS)
library(cluster)
# Finding the 75% highest density / minimum volume ellipse
fit <- cov.mve(samples, quantile.used = nrow(samples) * 0.75)
points_in_ellipse <- samples[fit$best, ]
ellipse_boundary <- predict(ellipsoidhull(points_in_ellipse))
# Plotting it
plot(samples, col = rgb(0, 0, 0, alpha = 0.2))
lines(ellipse_boundary, col="lightgreen", lwd=3)
legend("topleft", "50%", col = "lightgreen", lty = 1, lwd = 3)
Looking at this new plot we see that for the bulk of the probability mass the parameters are correlated. This correlation was not really visible in the naive scatter plot. If you rerun this code many times you will notice that the ellipse changes position slightly each time. This is due to cov.mve
using an non-exact algorithm. If you have a couple of seconds to spare you can make cov.mve
more exact by setting the parameter nsamp
to a large number, say nsamp = 10000
.
You are, of course, not limited to drawing just outlines and if you want to draw shaded ellipses you can use the polygon
function. The code below draws three shaded highest density ellipses of random color with coverages of 95%, 75% and 50%.
plot(samples, col = rgb(0, 0, 0, alpha = 0.2))
for(coverage in c(0.95, 0.75, 0.5)) {
fit <- cov.mve(samples, quantile.used = nrow(samples) * coverage)
ellipse_boundary <- predict(ellipsoidhull(samples[fit$best, ]))
polygon(ellipse_boundary, col = sample(colors(), 1), border = NA)
}
Looks like modern aRt to me!
The function bellow adds a highest density ellipse to an existing plot created using base graphics:
# Adds a highest density ellipse to an existing plot
# xy: A matrix or data frame with two columns.
# If you have to variables just cbind(x, y) them.
# coverage: The percentage of points the ellipse should cover
# border: The color of the border of the ellipse, NA = no border
# fill: The filling color of the ellipse, NA = no fill
# ... : Passed on to the polygon() function
add_hd_ellipse <- function(xy, coverage, border = "blue", fill = NA, ...) {
library(MASS)
library(cluster)
fit <- cov.mve(xy, quantile.used = round(nrow(xy) * coverage))
points_in_ellipse <- xy[fit$best, ]
ellipse_boundary <- predict(ellipsoidhull(points_in_ellipse))
polygon(ellipse_boundary, border=border, col = fill, ...)
}
So to replicate the above plot with the 75% highest density ellipse you could now write:
plot(samples)
add_hd_ellipse(samples, coverage = 0.75, border = "lightgreen", lwd=3)
Obviously, a highest density ellipse is only going to work well if the posterior is roughly elliptical. If this is not the case, an alternative is to use a 2D kernel density estimator on the samples
and trace out the coverage boundaries. The function HPDregionplot
in the emdbook package does exactly this:
library(emdbook)
plot(samples, col=rgb(0, 0, 0, alpha = 0.2))
HPDregionplot(samples, prob = c(0.95, 0.75, 0.5), col=c("salmon", "lightblue", "lightgreen"), lwd=3, add=TRUE)
legend("topleft", legend = c("95%", "75%", "50%"), col = c("salmon", "lightblue", "lightgreen"), lty=c(1,1,1), lwd=c(3,3,3))
You could also plot a 2d histogram of the samples
, for example, using the hexagon plot in ggplot2:
qplot(samples[,1], samples[,2], geom=c("hex"))
However you would have to work a bit with the color scheme if you wanted the colors to correspond to a given coverage.
Finally, if you plot a 2D density it could also be useful to add marginal density plots, as is done in the default plot for the Bayesian First Aid alternative to the correlation test. Here with completely fictional data on the number of shotguns and the number of zombie attacks per state in the U.S:
library(BayesianFirstAid)
fit <- bayes.cor.test(no_zombie_attacks, no_shotguns_per_1000_persons)
plot(fit)
Van Aelst, S. and Rousseeuw, P. (2009), Minimum volume ellipsoid. Wiley Interdisciplinary Reviews: Computational Statistics, 1: 71–82. Doi: 10.1002/wics.19, link to the paper (unfortunately behind paywall)
]]>Why R? Because S!
R is the open source implementation (and a pun!) of S, a language for statistical computing that was developed at Bell Labs in the late 1970s. After that, the implementation of S underwent a number of major revisions documented in a series of seminal books, often just referred to by the color of their cover: The Brown Book, the Blue Book, the White Book and the Green Book. To satisfy my techno-historical lusts I recently acquired all these books and I though I would share some tidbits from them, highlighting how S (and thus R) developed into what we today love and cherish. But first, here are the books in chronological order from left to right:
Most of these are out of print, but all can be bought second hand on, for example, Amazon (where is where I got them and where the links above lead).
by Richard A. Becker and John M. Chambers
This book from 1984 describes not the first version of S, but the second (S2) according to the versioning used here by Chambers. It describes a language that is very similar to modern R (but also very different). We recognize friends like c
…
… and plot
:
But note that plot
was only for scatter plots and was not a generic function producing different types of plots as in modern R. This, because S didn’t yet have objects and classes. S had, however, state of the art graphing capabilities from the start, implementing the plot types described in Graphical Methods for Data Analysis (1983) (also written by John M. Chambers and which I’ve written about here). For example, the very useful pairs
function was already there:
While many things were similar to modern R, not everything was. For one thing, you could not define your own functions! Instead you would have to rely on macros:
Here ?T
in the macro is another macro producing a temporary variable name in order to not clash with any global variable name, crazy!
We also find answer to why some of the peculiarities of modern R exists. Have you ever wondered why many function and parameter names in R are period.separated
rather than underscored_spearated
? Well, because in S2 underscore was an alias for <-
!
On to a surprise finding… Rstudio are doing great things and for a while it has been possible to make slides using R markdown in Rstudio. Is this great? Sure! Is it new? Nope… :) Slide construction was already easy to do in S anno 1984 using the vu
function. This function took a string written in a special markup language…
… and produced slides on the graphic device, such as this:
Unfortunately vu
didn’t make it all the way to modern R.
I don’t want to brag, but I’m gonna do it anyway: I recently got my copy of the “Brown book” signed by John Chambers himself at the UseR 2014 conference! :D
S: An Interactive Environment for Data Analysis and Graphics on Amazon
by Richard A. Becker and John M. Chambers
This book is not part of the color book canon, but I’ll include it for completeness anyway. Published the year after the Brown book, it describes how to implement new functions in S. However, as S only had support for macros, these functions would have to be written in another language (say FORTRAN) and then connected to S using a special interface language:
While not relevant to modern R, this interface language is the “ancestor” of modern day interfaces such as Rcpp and Rcpp11.
Extending The S System on Amazon
by Richard A. Becker, John M. Chambers and Allan R. Wilks
This book introduces S version three (S3) which was a major revision of S2. While S2 was primarily programmed in FORTRAN, S3 was mainly done in C. The interface language was now gone and instead C functions could be directly invoked from S functions. But what’s more, users could now easily define functions themselves!
Functions were also first class citizens and could be passed around thus enabling the modern apply
type functions:
Computation on the language was also now possible, for example, by using substitute
. Some things were still different from modern day R, take a look at the following statement:
Why lottery.number
and lottery.payoff
instead of lottery$number
and lottery$payoff
? Because data.frames
didn’t yet exist! (Though it would still have been possible to stick two vectors inside a list.)
The New S Language: A Programming Environment for Data Analysis and Graphics on Amazon
edited by John M. Chambers and Trevor J. Hastie
This book “completes” the specification of S3 with three biggies: (1) data frames, (2) formulas…
… and (3) object orientation:
While the earlier books are more focused on graphics and programming, this book is all about statistical models (the title of the book might be a hint). Here we get introduced to workhorses like glm
, gam
, nls
, tree
and, not to forget, lm
:
There is, however, no mention of the classical *.test
functions such as t.test
, binom.test
and cor.test
(Do anybody know when they appeared in S/R?). The focus is also more on prediction and estimation rather than testing, for example, p-values were not reported as part of summary.lm
(which they are in modern R):
Other things that are new are ?
, which can now be used to look up help pages, and that there is a new datatype called factor
. And already from the start read.table
converted all strings to factors by default. :) All in all, this book was interesting to read and is still, I believe, a very good introduction to the formula interface and the lm
/glm
/gam
type functions.
Statistical Models in S on Amazon
by John M. Chambers
This book describes S version four and focuses almost exclusively on programming and not so much on stats and graphics. A big change from S3 was the introduction of a new, more formal, system for object oriented programming:
Other than that there weren’t any eye catching differences from S version 3. One small thing to note is that =
could now be used for assignment instead of <-
and is actually used consistently throughout the book:
Programming with Data: A Guide to the S Language on Amazon
That was all I had. If you are further interested in the history of S and R I also recommend A brief history of S (Becker, 1994) , Stages in the Evolution of S (Chamers, 2000) and R: Past and future history (Ihaka, 1998).
All images and quotes included in this review are copyrighted by their respective copyrighted holders, however I believe that the inclusion of these quotes and images in in this review constitutes fair use.
]]>After having broken the Bayesian eggs and prepared your model in your statistical kitchen the main dish is the posterior. The posterior is the posterior is the posterior, given the model and the data it contains all the information you need and anything else will be a little bit less nourishing. However, taking in the posterior in one gulp can be a bit difficult, in all but the most simple cases it will be multidimensional and difficult to plot. But even if it is one-dimensional and you could plot it (as, say, a density plot) that does not necessary mean that it is easy to see what’s going on.
One way of getting around this is to take a bite at a time and look at summaries of the marginal posteriors of the variables of interest, the two most common type of summaries being point estimates and credible intervals (an interval that covers a certain percentage of the probability distribution). Here one is faced with a choice, which of the many ways of constructing point estimates and credible intervals to choose? This is a perfectly good question that can be given an unhelpful answer (with a predictable follow-up question):
- That depends on your loss function.
- So which loss function should I use?
The reason for this exchange is that most summaries of the posterior can be seen as minimizing a loss given one or another loss function. This way of viewing posterior summaries is part of statistical decision theory, and is useful, coherent and the topic of many books (and the possible forthcoming Part 2 of this blog post).
One can, however, also view posterior summaries as just graphical summaries. That is, as compact, convenient ways of looking at the posterior, well knowing that these summaries are not the whole picture, just a convenient graphical representation. This post will go through the following six common point estimates and credible intervals from this perspective: The posterior mode, median and mean, and the standard deviation of the posterior, the quantile interval and the highest density interval. I will use the following hypothetical posteriors to showcase these summaries:
These distributions are more or less symmetric, skewed, bi-modal and short-tailed. They are also more or less commonly encountered, with the top-left symmetric heap shaped distribution being the archetypal well behaved posterior and the lower-right distribution being extremely badass bi-modal.
If you would represent a density plot by a point and an interval a reasonable choice would be to use the mode (the highest point) and the interval that contains the highest part of the density. Below is the mode and the highest density interval covering 95% of the probability density for the six example posteriors:
It is, of course, up to you whether you think these points and intervals represent the underlying probability densities well or not. It obviously does not work well in the bi-modal case and it might be a bit strange that the point estimate for the upper-middle distribution is pushed all the way to the left, but otherwise I think it works pretty well. For most of the densities I get a pretty good idea regarding what the underlying posterior looks like.
There is no function in base R to directly calculate these measures given a sample s
representing the posterior. A quick-n-dirty function for estimating the mode can be defined by taking the maximum of a density estimate, like this:
estimate_mode <- function(s) {
d <- density(s)
d$x[which.max(d$y)]
}
estimate_mode(s)
The are also many functions for estimating the mode in the modeest
package, for example, the half sample mode:
library(modeest)
mlv(s, method = "HSM")
A function for estimating the highest density interval is available as part of the coda
package:
library(coda)
HPDinterval(mcmc(s), 0.95)
It is often pointed out that the mode and the highest density interval are not invariant to transformations of the x-axis, however, neither is a density plot. When viewing these summaries as graphical summaries I would consider this non-invariance a feature rather than a bug.
The median and the quantile interval have, to me, the advantage of being really easy to interpret. Both sit smack-in-the-middle of the distribution, with the median having 50% of the probability to its left and 50% to its right and the quantile interval leaving, say, 2.5% probability on either side. It’s also hard not to love medians after having read John Tukey’s Exploratory Data Analysis. Here is what these two summaries look like when applied to our six example distributions:
The median and quantile interval can easily be calculated using functions in base R:
median(s)
quantile(s, c(0.025, 0.975)) # for a 95% interval.
The mean and the standard deviation (SD) seems to be everybody’s favorite measures of central tendency and spread. They are, however, not my favorite choice for graphical summaries of posteriors. When the distribution is skewed and fat tailed the mean can be far from the center of the density. The SD, when used to plot a symmetric interval around the mean, can misrepresent the posterior by extending out into low probability land. Below are the posterior means and SD intervals for the six example distributions. The SD intervals extend 1.96 * sd(s)
out from the mean, which would result in a 95% credible interval if the posterior had been normally distributed.
This approach doesn’t work that well in, for example, the middle-lower case as the mean is far from the center of the density plot and the SD interval extends too far to the left.
Let’s compare the three approaches. Below we have the mode with a highest density interval, the median with a quantile interval and the mean with an interval showing 1.96 × the SD of the posterior:
Some observations:
Other than that it is up to you to choose which graphical summary you feel most comfortable with.
There is no specific reason for why the point estimates and credible intervals have to be paired as above, you can mix ‘n’ match. Let’s plot the medians together with the highest density interval:
There is also nothing stopping you from plotting two, or more, intervals at the same time. The plot below shows the mode with both 95% and 50% highest density intervals.
Functions to create plots like these, with two credible intervals with different coverage, are implemented in the mcmcplots
package and in the ggmcmc
package.
Another option, if you want to visualize many posteriors at the same time, is to use a more compact density plot such as the violin plot (here using the implementation available in ggplot2
):
Big data is all the rage, but sometimes you don’t have big data. Sometimes you don’t even have average size data. Sometimes you only have eleven unique socks:
Karl Broman is here putting forward a very interesting problem. Interesting, not only because it involves socks, but because it involves what I would like to call Tiny Data™. The problem is this: Given the Tiny dataset of eleven unique socks, how many socks does Karl Broman have in his laundry in total?
If we had Big Data we might have been able to use some clever machine learning algorithm to solve this problem such as bootstrap aggregated neural networks. But we don’t have Big Data, we have Tiny Data. We can’t pull ourselves up by our bootstraps because we only have socks (eleven to be precise). Instead we will have to build a statistical model that includes a lot more problem specific information. Let’s do that!
We are going to start by building a generative model, a simulation of the I’m-picking-out-socks-from-my-laundry process. First we have a couple of parameters that I’m just for now are going to give arbitrary values:
n_socks <- 18 # The total number of socks in the laundry
n_picked <- 11 # The number of socks we are going to pick
In an ideal world all socks would come in pairs, but we’re not living in an ideal world and some socks are odd (aka singletons). So out of the n_socks
let’s say we have
n_pairs <- 7 # for a total of 7*2=14 paired socks.
n_odd <- 4
We are now going to create a vector of socks, represented as integers, where each pair/singleton is given a unique number.
socks <- rep( seq_len(n_pairs + n_odd), rep(c(2, 1), c(n_pairs, n_odd)) )
socks
## [1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 9 10 11
Finally we are going to simulate picking out n_picked
socks (or at least n_socks
if n_picked
> n_socks
) and counting the number of sock pairs and unique socks.
picked_socks <- sample(socks, size = min(n_picked, n_socks))
sock_counts <- table(picked_socks)
sock_counts
## picked_socks
## 1 3 4 5 7 8 9 10 11
## 1 2 2 1 1 1 1 1 1
c(unique = sum(sock_counts == 1), pairs = sum(sock_counts == 2))
## unique pairs
## 7 2
So for this particular run of the sock picking simulation we ended up with two pairs and seven unique socks. So far so good, but what about the initial problem? How to estimate the actual number of socks in Karl ‘s laundry? Oh, but what you might not realize is that we are almost done! :)
Approximate Bayesian Computation (ABC) is a super cool method for fitting models with the benefits of (1) being pretty intuitive and (2) only requiring the specification of a generative model, and with the disadvantages of (1) being extremely computationally inefficient if implemented naïvely and (2) requiring quite a bit of tweaking to work correctly when working with even quite small datasets. But we are not working with Quite Small Data, we are working with Tiny Data! Therefore we can afford a naïve and inefficient (but straight forward) implementation. Fiercely hand-waving, the simple ABC rejection algorithm goes like this:
For a less brief introduction to ABC see the tutorial on Darren Wilkinson’s blog. The paper by Rubin (1984) is also a good read, even if it doesn’t explicitly mention ABC.
So what’s left until we can estimate the number of socks in Karl Broman’s laundry? Well, we have a reasonable generative model, however, we haven’t specified any prior distributions over the parameters we are interested in: n_socks
, n_pairs
and n_odd
. Here we can’t afford to use non-informative priors, that’s a luxury reserved for the Big Data crowd, we need to use all the information we have. Also, the trade-off isn’t so much about how informative/biased we should be but rather about how much time/resources we can spend on developing reasonable priors. The following is what I whipped up in half an hour and which could surely be improved upon:
What can be said about n_socks
, the number of socks in Karl Broman’s laundry, before seeing any data? Well, we know that n_socks
must be positive (no anti-socks) and discrete (socks are not continuous). A reasonable choice would perhaps be to use a Poisson distribution as a prior, however the Poisson is problematic in that both its mean and its variance is set by the same parameter. Instead we could use the more flexible cousin to the Poisson, the negative binomial. In R the rnbinom
function is parameterized with the mean mu
and size
. While size
is not the variance, there is a direct correspondence between size
and the variance s^2
:
size = -mu^2 / (mu - s^2)
If you are a family of 3-4 persons and you change socks around 5 times a week then a guesstimate would be that you have something like 15 pairs of socks in the laundry. It is reasonable that you at least have some socks in your laundry, but it is also possible that you have much more than 15 * 2 = 30 socks. So as a prior for n_socks
I’m going to use a negative binomial with mean prior_mu = 30
and standard deviation prior_sd = 15
.
prior_mu <- 30
prior_sd <- 15
prior_size_param <- -prior_mu^2 / (prior_mu - prior_sd^2)
n_socks <- rnbinom(1, mu = prior_mu, size = prior_size_param)
Instead of putting a prior distribution directly over n_pairs
and n_odd
I’m going to put it over the proportion of socks that come in pairs prop_pairs
. I know some people keep all their socks neatly paired, but only 3/4 of my socks are in a happy relationship. So on prop_pairs
I’m going to put a Beta prior distribution that puts most of the probability over the range 0.75 to 1.0. Since socks are discrete entities we’ll also have to do some rounding to then go from prop_pairs
to n_pairs
and n_odd
.
prop_pairs <- rbeta(1, shape1 = 15, shape2 = 2)
n_pairs <- round(floor(n_socks / 2) * prop_pairs)
n_odd <- n_socks - n_pairs * 2
Now we have a generative model, with reasonable priors, and what’s left is to use the ABC rejection algorithm to generate a posterior distribution of the number of socks in Karl Broman’s laundry. The following code bring all the earlier steps together and generates 100,000 samples from the generative model which are saved, together with the corresponding parameter values, in sock_sim
:
n_picked <- 11 # The number of socks to pick out of the laundry
sock_sim <- replicate(100000, {
# Generating a sample of the parameters from the priors
prior_mu <- 30
prior_sd <- 15
prior_size <- -prior_mu^2 / (prior_mu - prior_sd^2)
n_socks <- rnbinom(1, mu = prior_mu, size = prior_size)
prop_pairs <- rbeta(1, shape1 = 15, shape2 = 2)
n_pairs <- round(floor(n_socks / 2) * prop_pairs)
n_odd <- n_socks - n_pairs * 2
# Simulating picking out n_picked socks
socks <- rep(seq_len(n_pairs + n_odd), rep(c(2, 1), c(n_pairs, n_odd)))
picked_socks <- sample(socks, size = min(n_picked, n_socks))
sock_counts <- table(picked_socks)
# Returning the parameters and counts of the number of matched
# and unique socks among those that were picked out.
c(unique = sum(sock_counts == 1), pairs = sum(sock_counts == 2),
n_socks = n_socks, n_pairs = n_pairs, n_odd = n_odd, prop_pairs = prop_pairs)
})
# just translating sock_sim to get one variable per column
sock_sim <- t(sock_sim)
head(sock_sim)
## unique pairs n_socks n_pairs n_odd prop_pairs
## [1,] 7 2 32 15 2 0.9665
## [2,] 7 2 21 9 3 0.9314
## [3,] 3 4 20 8 4 0.8426
## [4,] 11 0 47 23 1 0.9812
## [5,] 9 1 36 15 6 0.8283
## [6,] 7 2 16 5 6 0.6434
We have used quite a lot of prior knowledge, but so far we have not used the actual data. In order to turn our simulated samples sock_sim
into posterior samples, informed by the data, we need to throw away those simulations that resulted in simulated data that doesn’t match the actual data. The data we have is that out of eleven picked socks, eleven were unique and zero were matched, so let’s remove all simulated samples which does not match this.
post_samples <- sock_sim[sock_sim[, "unique"] == 11 &
sock_sim[, "pairs" ] == 0 , ]
And we are done! Given the model and the data, the 11,506 remaining samples in post_samples
now represent the information we have about the number of socks in Karl Broman’s laundry. What remains is just to explore what post_samples
says about the number of socks. The following plot shows the prior sock distributions in green, and the posterior sock distributions (post_samples
) in blue:
Here the vertical red lines show the median posterior, a “best guess” for the respective parameter. There is a lot of uncertainty in the estimates but our best guess (the median posterior) would be that there in Karl Broman’s laundry are 19 pairs of socks and 6 odds socks for a total of 19 × 2 + 6 = 44 socks.
How well did we do? Fortunately Karl Broman later tweeted the actual number of pairs and odd socks:
Which totals 21 × 2 + 3 = 45 socks. We were only off by one sock! Our estimate of the number of odd socks was a little bit high (Karl Broman is obviously much better at organizing his socks than me) but otherwise we are actually amazingly spot on! All thanks to carefully selected priors, approximate Bayesian computation and Tiny Data.
For full disclosure I must mention that Karl made this tweet before I had finished the model, however, I tried hard not to be influenced by that when choosing the priors of the model.
If you look at the plot of the priors and posteriors above you will notice that they are quite similar. “Did we even use the data?”, you might ask, “it didn’t really seem to make any difference anyway…”. Well, since we are working with Tiny Data it is only natural that the data does not make a huge difference. It does however make a difference. Without any data (also called No Data™) the median posterior number of socks would be 28, a much worse estimate than the Tiny Data estimate of 44. We could also see what would happen if we would fit the model to a different dataset. Say we got four pairs of socks and three unique socks out of the eleven picked socks. We would then calculate the posterior like this…
post_samples <- sock_sim[sock_sim[, "unique"] == 3 &
sock_sim[, "pairs" ] == 4 , ]
The posterior would look very different and our estimate of the number of total socks would be as low as 15:
So our Tiny Data does matter! There are, of course, other ways to criticize my model:
As a final note, don’t forget that you can follow Karl Broman on Twitter, he is a cool guy (even his tweets about laundry are thought provoking!) and if you aren’t already tired of clothes related analyses and Tiny Data you can follow me too :)
Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4), 1151-1172. fultext
]]>