“Behind every great point estimate stands a minimized loss function.” – Me, just now
This is a continuation of Probable Points and Credible Intervals, a series of posts on Bayesian point and interval estimates. In Part 1 we looked at these estimates as graphical summaries, useful when it’s difficult to plot the whole posterior in good way. Here I’ll instead look at points and intervals from a decision theoretical perspective, in my opinion the conceptually cleanest way of characterizing what these constructs are.
If you don’t know that much about Bayesian decision theory, just chillax. When doing Bayesian data analysis you get it “pretty much for free” as esteemed statistician Andrew Gelman puts it. He then adds that it’s “not quite right because it can take effort to define a reasonable utility function.” Well, perhaps not free, but it is still relatively straight forward! I will use a toy problem to illustrate how Bayesian decision theory can be used to produce point estimates and intervals. The problem is this: Our favorite droid has gone missing and we desperately want to find him!
Robo went missing 23:00 yesterday and haven’t been seen since. We know he disappeared somewhere within a 120 miles long strip of land and we are going to mount a search operation. Our top scientists have been up all night analyzing the available data and we just received the result: the probability of Robo being in different locations.
So, this is (in Bayesian lingo) a posterior distribution, the probability of different “states” after having analyzed the available data. Here the “state” is the location of Robo and looking at the posterior above it seems like he could be in a lot of places. Most likely he is in the forest, somewhere between 75 and 120 miles from the reference point (arbitrarily set to the left most position on the map). He might also be hiding in the plains, either around the 15th or the 40th mile. It’s not that likely that he’s in the mountains, but we can’t dismiss it altogether.
- So, where should we start looking for Robo?
- Well, that depends…
- Depends on what?
- Your loss function.
A loss function is some method of calculating how bad a decision would be if the world is in a certain state. In our case the state is the location of Robo, the decision could be where to start looking for him, and badness could be the time it will take to find him (we want to find him fast!). If we knew the state of the world, we could find the best decision: the decision that minimizes the loss. Now, we don’t actually know that state but, if we have a Bayesian model that we believe does a good job, we can use the resulting posterior to represent our knowledge about that state. That is, we are going to plug in a possible decision, and a posterior distribution, to our loss function and the result will be a probability distribution over how large the loss might be. Doing this is really easy, especially if the posterior is represented as a sample of values (which is almost always the case when doing Bayesian data analysis anyway). Of course, we could skip a formal decision analysis and just look at the posterior and make a non-formalized decision. In many case that might be the preferred course, but it’s not why we are here today.
So we call our science team up and ask them to send over the posterior represented as a large sample of positions, let’s call that list s
. Here are the first 16 samples in s
:
head(s, n = 16)
## [1] 15 101 41 89 14 41 83 112 33 33 94 104 88 82 77 18
As expected, these values are mostly clustered around the 15th, 40th and 90th mile. Say that our loss function is the distance from where we start searching to the location of Robo, and our decision is to start the search at the 60th mile. To get a probability distribution for the loss we simply apply the loss function to each sample in s
. For the first sample the loss is abs(15 - 60)
= 45, for the second sample it’s abs(101 - 60)
= 41, and so on. Below is the resulting posterior loss, given the decision to start searching at the 60th mile:
The plot above also show the expected loss (here the mean of the posterior distance from the 60th mile). This is a common measure of how good a decision is and the final step in a decision analysis would be to find the decision that minimizes the expected loss. And that’s it! As Gelman mentioned, the hard part is defining a reasonable loss function, but once you have done that, it’s straight forward to find the decision that minimizes the expected loss.
The rest of the post is dedicated to showing how one can define different loss functions for the “Where’s Robo?” scenario. I will start out with some simple loss functions that result in point estimates and end with some more complicated loss function that result in interval estimates.
A Bayesian point estimate is the result of a decision analysis where you (or perhaps your computer) have found the best point/location/value given a posterior distribution and some loss function. Meanwhile, the management have decided that the search party will be deployed by helicopter and that, once on the ground, it will split into two groups, one searching to the right and one searching to the left. Now we only need to decide where to start the search for Robo. Thus, we desperately need a Bayesian point estimate and to get that we need a loss function!
When it comes to loss functions there are three usual suspects: squared loss, absolute loss and 0-1 loss (also known as L2, L1 and L0 loss). We’ve already seen absolute loss (L1), it’s when the loss is the distance between the decision and the state of the world, that is, the absolute value of the difference between the point x
and the state s
. In R code:
absolute_loss <- function(x, s) {
mean(abs(x - s))
}
Due to taking the mean
, this function will also return the expected absolute loss when s
is a posterior sample. For our present purpose this is a pretty decent loss function. Assuming that the two search groups walk at constant speed, this function will minimize the expected time/cost it takes to find Robo.
Squared loss (L2) is another common loss function:
squared_loss <- function(x, s) {
mean((x - s)^2)
}
Using this loss function would mean that we consider it four times as bad if it takes twice the time to find Robo (again assuming the two search groups walk at constant speed). So squared loss might not make that much sense for the present problem.
The last loss function is 0-1 loss (L0) which assigns zero loss to a decision that is correct and one loss to an incorrect decision. Given this loss function the best decision is to choose the most probable state. This loss function make sense if you are, say, defusing a bomb and need to choose between the green, blue and red wire (if you make the right decision = no loss of limbs, cut the wrong wire = Boom!). When searching for Robo it doesn’t really make sense to say that starting the search 1 mile from Robo’s location is as bad as starting it on the moon. As Robo’s position is a real number, the posterior probability of him being in any specific position is practically zero. In this continuous case we can instead use the posterior probability density. If you have a sample from the posterior (s
) then the density can be approximated using density(s)
. As the resulting density is given at discrete points we have to use approx
to interpolate the density at the decision x
, and we have to negate the resulting density estimate to turn this into a loss. Here is the whole function:
zero_one_loss <- function(x, s) {
# This below is proportional to 0-1 loss.
- approx(density(s), xout = x)$y
}
So, where should we start the search operation according to these three loss functions? To figure this out we just need to determine what decision minimizes the expected loss. This can be done in more or less intelligent ways, but I went brute force and just tried all positions from the 0th to the 120th mile. Here are the resulting point estimates (with the loss functions below):
According to the absolute loss criteria (L1) we should start looking in the forest, according to the quadratic loss (L2) we should start in the mountains and 0-1 loss (L0) goes for the single most probable location at the 15th mile. The way with which I found these point estimates is very general, evaluate the loss function all possible decisions (or a representative sample) and pick the decision with the smallest expected loss. However, for these specific loss functions there is a much easier way: the minimum expected absolute loss corresponds to the median of the posterior, the quadratic loss corresponds to the mean, and 0-1 loss corresponds to the mode. That is, the same three point estimates we looked at in Part 1 of Probable Points and Credible Intervals! Why exactly this is the case is beautifully explained by John Myles White on his blog.
Note, that using the three loss functions above result in widely different decisions, it’s a big difference between landing the search team in the forest and in the windy mountains, and it’s a bit strange that the loss functions don’t consider aspects of the problem such as the terrain. Going forward I will explore a couple of different loss functions more suited to the “Where’s Robot?” scenario. This is not because these are loss functions that are especially useful and widely applicable, but rather because I want to show how easy it is to define new loss functions when doing Bayesian decision analysis.
We got a call from from management and they have decided that instead of sending a search team, we are going to do a satellite scan that is guaranteed to find any robot within a radius of 30 miles, and now they want to know where to target it. This calls for another loss function! As with 0-1 loss we want to minimize the expectation of not finding Robo, but now it is within a certain radius around the decision point x
. In R:
limited_dist_loss <- function(x, s, max_dist) {
mean(abs(x - s) > max_dist)
}
This code calculates the expectation of Robo being outside the max_dist
radius around x
, where max_dist
should be set to 30 in our case. Using this loss function with our posterior s
gives us the following graph:
So, we should center the scan on the 89th mile, which will scan the forest and part of the mountains, and which will result in a 1.0 - 0.4 = 0.6 probability of finding Robo.
We got new info from management: The droid carries some space station plans critical to the empire something something. Anyway, we need to find Robo fast, within 24 hours! Again we are going to deploy a search team that will split into two groups, but this time we need to consider how long time it will take for the teams to find him. It takes different amounts of time to search different types of terrain: a mile of plains takes one hour, a mile of forest takes five hours and a mile of mountains takes ten. The list cover_time
encodes the time it takes to search each mile, here we have the 48th to the 54th mile:
cover_time[48:54]
## plain plain plain mountain mountain mountain mountain
## 1 1 1 10 10 10 10
The following loss function calculates the expectation of not finding Robo within max_time
hours by calculating the time to Robo’s location from the starting point x
and then taking the expectation of this time being longer than max_time
:
limited_time_loss <- function(x, s, max_time) {
time_to_robo <- sapply(s, function(robo_pos) { sum(cover_time[x:robo_pos]) })
mean(time_to_robo > max_time )
}
This is how the expected loss looks for different starting points with max_time
set to 24:
Our best bet is to start at the 27th mile which means we will cover the whole plains area within 24 hours. If we instead had to find Robo within 72 hours it would be better to start at the 90th mile, as we now would have time to search the forest region:
A Bayesian interval estimate is the result of a decision analysis where you have found the best interval given a posterior distribution and some loss function. To decide what interval or region to search through is perhaps a more natural decision when looking for Robo, rather than deciding where to land a search team. In part one we looked at some different type of intervals, one was the highest density interval (HDI) defined as the shortest interval that contains a given percentage of the posterior probability. The HDI can also be defined as the interval that minimizes an expected loss (the specific loss function is derived here, but is a tiny bit complicated). Say that we want to find Robo with 90% probability while having to search through the smallest region. The best decision would then be the following 90% HDI:
A strange thing with this type of interval is that we limit the probability of finding Robo. Surely we would like to find Robo with the highest probability possible. What’s limiting us from finding Robo with a 100% probability should be time, effort or cost.
Let’s define an interval that is limited by cost instead. The management have decided that the search operation will cost \$1000. One hour of search costs \$100 and using our knowledge about how long time it takes to search a mile of each type of terrain we can calculate the corresponding search cost:
search_cost <- cover_time * 100
search_cost[48:54]
## plain plain plain mountain mountain mountain mountain
## 100 100 100 1000 1000 1000 1000
So searching through the 48th to the 54th mile would cost \$4300, a bit over budget. What we want to find is the interval with the highest expectation of finding Robo but that costs no more than \$1000 to search through. The loss function is basically just a variation of limited_time_loss
but with two parameters: the lower and the upper endpoints of the interval. To find the best interval I, again, just try all combinations of upper and lower endpoints and pick out the interval with the lowest loss (highest expectation of finding Robo) which costs \$1000 or less to search through:
So, if we just have \$1000 to spend we should go for the easy option and just search through the high probability region on the plains. What if we had \$3000 to spend?
Then we should search through almost the whole plains region. What if we had \$20,000?
Then we should go for the forest (and still stay away from the mountains, they don’t make good fiscal sense).
Now, we call up management and tell them that “it’s all good and well that you want to spend \$1000 on finding Robo, but why \$1000 exactly? And wouldn’t you want to spend as little as possible?” They tell us that, yes, they would like to spend as little as possible and the reason for the \$1000 figure is because that’s what Robo is worth. Ok, so what we want to do is to mount a search that maximizes the expected profit considering that Robo is worth \$1000 and that it costs \$100 per search hour. This calls for a utility function (something you want to maximize) which is just the opposite of a loss function (something you want to minimize). While any utility function can easily be cast into a loss function, it’s sometimes more natural to think of maximizing utility (say in finance) than minimizing losses. Loss functions and utility functions can both be called objective functions.
This is going to be the most complicated objective function in this post. The search is going to happen like this: We are going to start the search at one location, search in one direction, and stop the search if (1) we find Robo or (2) we have reached the location that marks the end of the search operation. The decision we have to make is where the search starts and where it terminates (in case we don’t find Robo). What we want to calculate is the expected profit given such a decision. The following function takes a start
, an end
and a robo_value
, calculates the profit for each sample from Robo’s posterior position s
and returns the expected profit:
expected_profit <- function(start, end, robo_value, s) {
posterior_profit <- sapply(s, function(robo_pos) {
if(robo_pos >= start & robo_pos <= end) {
#that is, we find Robo and terminate the search at robo_pos
covered_ground <- start:robo_pos
robo_value - sum(search_cost[covered_ground])
} else {
# that is, we won't find Robo and terminate the search at end instead,
covered_ground <- start:end
- sum(search_cost[covered_ground])
}
})
mean(posterior_profit)
}
If we evaluate this utility function for (a representative sample of) all possible values for start
and end
, and with robo_value
set to 1000, the maximum expected profit decision is this interval:
Huh? That doesn’t look like an interval… But it is, you’re looking at an empty interval. The best decision is to not search for Robo at all (with an expected profit of \$0) any search operation will result in an expected negative profit (that is, a loss). Good to know! What if Robo was worth more money? Say, \$10,000?
Then we should search a small part of the plains, starting at the 12th mile, for an expected profit of \$835. If Robo was worth \$20,000?
Then we should search most of the plains, still starting from the left, for an expected profit of \$3733. Say, if Robo was really valuable?
Then we want to search the whole strip of land for an expected profit of \$16590. Notice that we should always start at the left, this is because it is relatively cheap to search the plains and our profit will be much higher if we find Robo before we spend to much on the search operation.
So this was just a sample of possible loss/utility functions for a simple toy problem, some better and some worse. I stuck with point decisions and interval decisions, but there is no reason for why you should be limited to single intervals. Perhaps you want launch several search parties, or perhaps you want to update the decision as you get new information. Constructing reasonable loss functions for real world problems can be very challenging, but the point is that Bayesian decision analysis still works in the same way as outlined in this post: (1) Get a posterior, (2) define a loss function, (3) find the decision that minimizes the expected loss. My toy example featured only a one-dimensional posterior, but the procedure would be no different with a multi-dimensional posterior (except for the added difficulty of optimizing a high-dimensional loss function).
Update: João Pedro at Faculdade de Ciências da Universidade de Lisboa has reimplemented the Robo scenario, but in 2D! Check it out, it’s really nice!
Another point I want to get across is that loss/utility functions are models of what’s bad/good and as such they can be made arbitrarily complex and are in some sense never “true”, in the same way as a statistical model of some process is never the “true” model. Or as Hennig and Kutlukaya (2007) puts it:
“There is no objectively best loss function, because the loss function defines what ‘good’ means.”
By the way, did you find Robo?
Hennig, C., & Kutlukaya, M. (2007). Some thoughts about the design of loss functions. REVSTAT–Statistical Journal, 5(1), 19-39. pdf
White, J. M. (2013). Modes, Medians and Means: A Unifying Perspective. link
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. Springer. Amazon link
Chapter 9 in Gelman et al (2013). Bayesian data analysis. CRC press. Amazon link
]]>Peter Norvig, the director of research at Google, wrote a nice essay on How to Write a Spelling Corrector a couple of years ago. That essay explains and implements a simple but effective spelling correction function in just 21 lines of Python. Highly recommended reading! I was wondering how many lines it would take to write something similar in base R. Turns out you can do it in (at least) two pretty obfuscated lines:
sorted_words <- names(sort(table(strsplit(tolower(paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")), "[^a-z]+")), decreasing = TRUE))
correct <- function(word) { c(sorted_words[ adist(word, sorted_words) <= min(adist(word, sorted_words), 2)], word)[1] }
While not working exactly as Norvig’s version it should result in similar spelling corrections:
correct("piese")
## [1] "piece"
correct("ov")
## [1] "of"
correct("cakke")
## [1] "cake"
So let’s deobfuscate the two-liner slightly (however, the code below might not make sense if you don’t read Norvig’s essay first):
# Read in big.txt, a 6.5 mb collection of different English texts.
raw_text <- paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")
# Make the text lowercase and split it up creating a huge vector of word tokens.
split_text <- strsplit(tolower(raw_text), "[^a-z]+")
# Count the number of different type of words.
word_count <- table(split_text)
# Sort the words and create an ordered vector with the most common type of words first.
sorted_words <- names(sort(word_count, decreasing = TRUE))
correct <- function(word) {
# Calculate the edit distance between the word and all other words in sorted_words.
edit_dist <- adist(word, sorted_words)
# Calculate the minimum edit distance to find a word that exists in big.txt
# with a limit of two edits.
min_edit_dist <- min(edit_dist, 2)
# Generate a vector with all words with this minimum edit distance.
# Since sorted_words is ordered from most common to least common, the resulting
# vector will have the most common / probable match first.
proposals_by_prob <- c(sorted_words[ edit_dist <= min(edit_dist, 2)])
# In case proposals_by_prob would be empty we append the word to be corrected...
proposals_by_prob <- c(proposals_by_prob, word)
# ... and return the first / most probable word in the vector.
proposals_by_prob[1]
}
Some thoughts:
adist
function. (A one line spell checker in R is indeed possible using the aspell
function :)sorted_words
vector would be a perfect target for some magrittr magic.NWORDS
variable in order to be able to extract the most probable matching word. This is not necessary in the R code, as we already have a sorted vector we know that the first item always will be the most probable. Still, I believe the two approaches result in the same spelling corrections (but prove me wrong :).HashMap<Integer, String> candidates = new HashMap<Integer, String>();
.Christmas is soon upon us and here are some gift ideas for your statistically inclined friends (or perhaps for you to put on your own wish list). If you have other suggestions please leave a comment! :)
A recently released game where probability takes the main role is Pairs, an easy going press-your-luck game that can be played in 10 minutes. It uses a custom “triangular” deck of cards (1x1, 2x2, 3x3, …, 10x10) and is a lot of fun to play, highly recommended!
Another good gift would be a pound of assorted dice together with the seminal Dice Games Properly Explained by Reiner Knizia. While perhaps not a game, a cool gift to someone that already has a pound of dice would be a set of Non transitive Grime dice.
A search for statistics and mugs or statistics and t-shirts results in a lot of good gifts, for example this t-test mug:
You could also support your favorite MCMC software by buying a STAN themed mug from their shop or why not come up with a custom layout yourself? (I’ve used Vistaprint before and those mugs turned out decent and cheap.)
R, Python and Julia are great tools that are perhaps becoming a bit too mainstream for the self-conscious data science hipster. Why not then give the joy of some retro calculation? Slide rulers are amazingly cool and while I don’t know if new are made they can be gotten cheap on ebay. The same goes for vintage pocket calculators (make sure to get one where the digits are in bright green or red). There is also the 50s book with the self describing title A Million Random Digits. (Don’t miss the hilarious reviews on Amazon!)
The XKCD web comic by Randall Munroe often touches upon statistical issues and while his recent book What If?: Serious Scientific Answers to Absurd Hypothetical Questions is not statistical per see, it contains heaps of amusing back-of-the-envelope calculations. You can also get signed prints of some of the comics, for example of #231 “Cat Proximity”:
I love comic books teaching statistics (which I’ve written about earlier) and my two favorites are The Cartoon Guide to Statistics by Larry Gonick and Woollcott Smith and The Manga Guide to Statistics by Shin Takahashi. Both are great in their own ways and are enjoyable both if you are a statistics padawan or already a master of the dark arts.
NausicaaDistribution sells cool stuff such as a Standard Normal Distribution Plushie, an Evil Cauchy Distribution Plushie and a lot more distributions of different shapes and alignments.
You can also buy a My First Number Sets Wood Puzzle for the budding number theoretician or Famous Statistician Embroidered Coasters to save the sofa table from the eggnog.
Here are some good popular science books that deals with different aspects of statistics and that anybody can enjoy:
Dataclysm: Who We Are (When We Think No One’s Looking) by Christian Rudder. I have not actually ready this (hint hint siblings…) but it’s bound to be good as it is written by the guy behind the OKcupid blog.
The Theory That Would Not Die by Sharon Bertsch McGrayne. The history of Bayes theorem and Bayesian statistics, contains almost no math but is fun and engaging anyway.
The Lady Tasting Tea by David Salsburg gives a more “classical” perspective on the history of statistics.
The Signal and the Noise by Nate Silver. If you know someone who, against all odds, haven’t already read this book then it is a great way to get that someone interested in statistics and data analysis.
Here are some slightly more serious books that I have enjoyed:
Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan by John Kruschke. The book that got me started with Bayesian data analysis, a pedagogical masterpiece that recently received a second edition.
Advanced R by Hadley Wickham. A great guide to serious R programming which is also freely available online (but is then slightly more difficult to gift-wrap…).
The Visual Display of Quantitative Information by Edward R. Tufte. This classic makes a great gift, not least because of its almost coffee table book like properties.
]]>
Making a slight digression from last month’s Probable Points and Credible Intervals here is how to summarize a 2D posterior density using a highest density ellipse. This is a straight forward extension of the highest density interval to the situation where you have a two-dimensional posterior (say, represented as a two column matrix of samples
) and you want to visualize what region, containing a given proportion of the probability, that has the most probable parameter combinations. So let’s first have a look at a fictional 2D posterior by using a simple scatter plot:
plot(samples)
Whoa… that’s some serious over-plotting and it’s hard to see what’s going on. Sure, the bulk of the posterior is somewhere in that black hole, but where exactly and how much of it?
A highest posterior density ellipse shows this by covering the area that contains the most probable parameter combinations while containing p% of the posterior probability. Like finding the highest density interval corresponds to finding the shortest interval containing p% of the probability, finding the highest density ellipse corresponds to finding the smallest ellipse containing p% of the probability a.k.a. the minimum volume ellipse. I have spent a lot of time trying to figure out how compute minimum volume ellipses. Wasted time, it turns out, as it can be easily computed using packages that come with R, you just have to know what you are looking for. If you just want the code skip over the next paragraph, if you want to know the tiny bit of detective work I had to do to figure this out, read on.
To find the points in sample
that are included in a minimum volume ellipse covering, say, 75% of the samples you can use cov.mve(samples, quantile.used = nrow(samples) * 0.75)
from the MASS package, here quantile.used
specifies the number of points in samples
that should be inside the ellipse. It uses an approximation algorithm described by Van Aelst, S. and Rousseeuw, P. (2009) that is not guaranteed to find the minimum volume ellipse but that will often be pretty close. A problem is that cov.mve
does not return the actual ellipse, it returns a robustly measured covariance matrix, but that’s not really what we are after. It does return an object that contains the indices of the points that are covered by the minimum volume ellipse, if fit
is the object returned by cov.mve
then these points can be extracted like this: points_in_ellipse <- samples[fit$best, ]
. To find the ellipse we are going to use ellipsoidhull
from the cluster package on the points_in_ellipse
. It returns an object which represents the minimum volume ellipse and by using its predict
function we get a two column matrix with points that lie on the hull of the ellipse and that we can finally plot.
That wasn’t too easy to figure out, but it’s pretty easy to do. The code below plots a 75% minimum volume / highest density ellipse:
library(MASS)
library(cluster)
# Finding the 75% highest density / minimum volume ellipse
fit <- cov.mve(samples, quantile.used = nrow(samples) * 0.75)
points_in_ellipse <- samples[fit$best, ]
ellipse_boundary <- predict(ellipsoidhull(points_in_ellipse))
# Plotting it
plot(samples, col = rgb(0, 0, 0, alpha = 0.2))
lines(ellipse_boundary, col="lightgreen", lwd=3)
legend("topleft", "50%", col = "lightgreen", lty = 1, lwd = 3)
Looking at this new plot we see that for the bulk of the probability mass the parameters are correlated. This correlation was not really visible in the naive scatter plot. If you rerun this code many times you will notice that the ellipse changes position slightly each time. This is due to cov.mve
using an non-exact algorithm. If you have a couple of seconds to spare you can make cov.mve
more exact by setting the parameter nsamp
to a large number, say nsamp = 10000
.
You are, of course, not limited to drawing just outlines and if you want to draw shaded ellipses you can use the polygon
function. The code below draws three shaded highest density ellipses of random color with coverages of 95%, 75% and 50%.
plot(samples, col = rgb(0, 0, 0, alpha = 0.2))
for(coverage in c(0.95, 0.75, 0.5)) {
fit <- cov.mve(samples, quantile.used = nrow(samples) * coverage)
ellipse_boundary <- predict(ellipsoidhull(samples[fit$best, ]))
polygon(ellipse_boundary, col = sample(colors(), 1), border = NA)
}
Looks like modern aRt to me!
The function bellow adds a highest density ellipse to an existing plot created using base graphics:
# Adds a highest density ellipse to an existing plot
# xy: A matrix or data frame with two columns.
# If you have to variables just cbind(x, y) them.
# coverage: The percentage of points the ellipse should cover
# border: The color of the border of the ellipse, NA = no border
# fill: The filling color of the ellipse, NA = no fill
# ... : Passed on to the polygon() function
add_hd_ellipse <- function(xy, coverage, border = "blue", fill = NA, ...) {
library(MASS)
library(cluster)
fit <- cov.mve(xy, quantile.used = round(nrow(xy) * coverage))
points_in_ellipse <- xy[fit$best, ]
ellipse_boundary <- predict(ellipsoidhull(points_in_ellipse))
polygon(ellipse_boundary, border=border, col = fill, ...)
}
So to replicate the above plot with the 75% highest density ellipse you could now write:
plot(samples)
add_hd_ellipse(samples, coverage = 0.75, border = "lightgreen", lwd=3)
Obviously, a highest density ellipse is only going to work well if the posterior is roughly elliptical. If this is not the case, an alternative is to use a 2D kernel density estimator on the samples
and trace out the coverage boundaries. The function HPDregionplot
in the emdbook package does exactly this:
library(emdbook)
plot(samples, col=rgb(0, 0, 0, alpha = 0.2))
HPDregionplot(samples, prob = c(0.95, 0.75, 0.5), col=c("salmon", "lightblue", "lightgreen"), lwd=3, add=TRUE)
legend("topleft", legend = c("95%", "75%", "50%"), col = c("salmon", "lightblue", "lightgreen"), lty=c(1,1,1), lwd=c(3,3,3))
You could also plot a 2d histogram of the samples
, for example, using the hexagon plot in ggplot2:
qplot(samples[,1], samples[,2], geom=c("hex"))
However you would have to work a bit with the color scheme if you wanted the colors to correspond to a given coverage.
Finally, if you plot a 2D density it could also be useful to add marginal density plots, as is done in the default plot for the Bayesian First Aid alternative to the correlation test. Here with completely fictional data on the number of shotguns and the number of zombie attacks per state in the U.S:
library(BayesianFirstAid)
fit <- bayes.cor.test(no_zombie_attacks, no_shotguns_per_1000_persons)
plot(fit)
Van Aelst, S. and Rousseeuw, P. (2009), Minimum volume ellipsoid. Wiley Interdisciplinary Reviews: Computational Statistics, 1: 71–82. Doi: 10.1002/wics.19, link to the paper (unfortunately behind paywall)
]]>