So in the last post I showed how to run the Bayesian counterpart of Pearson’s correlation test by estimating the parameters of a bivariate normal distribution. A problem with assuming normality is that the normal distribution isn’t robust against outliers. Let’s see what happens if we take the data from the last post with the finishing times and weights of the runners in the men’s 100 m semi-finals in the 2013 World Championships in Athletics and introduce an outlier. This is how the original data looks:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
What if Usain Bolt had eaten mutant cheese burgers both gaining in weight and ability to run faster?
Usain Bolt now looks like a real outlier, and he is, he’s not representative of the majority of runners that are not eating mutant cheeseburgers and that we would like to make inferences about. How does this single outlier influence the estimated correlation between finishing time and weight? Let’s rerun the correlation analysis from the last post and compare:
1 2 3 4 5 6 7 8 9 10 11 12 13
The estimated distribution of the data indicates that there now is a strong negative correlation between finishing time and weight! Looking at the distribution of the
… shows that the correlation is estimated to be -0.52 (95% CI: [-0.78,-0.11]) which is very different from the estimated correlation for the sans outlier data which was -0.10 (95% CI: [-0.51, 0.35]). I think it’s easy to argue that the influence of this one, single outlier is not reasonable.
What you actually often want to assume is that the bulk of the data is normally distributed while still allowing for the occasional outlier. One of the great benefits of Bayesian statistics is that it is easy to assume any parametric distribution for the data and a good distribution when we want robustness is the t-distribution. This distribution is similar to the normal distribution except that it has an extra parameter $\nu$ (also called the degrees of freedom) that governs how heavy the tails of the distribution are. The smaller $\nu$ is, the heavier the tails are and the less the estimated mean and standard deviation of the distribution are influenced by outliers. When $\nu$ is large (say $\nu$ > 30) the t-distribution is very similar to the normal distribution. So in order to make our Bayesian correlation analysis more robust we’ll replace the multivariate normal distribution (
dmnorm) from the last post with a multivariate t-distribution (
dmt). To that we’ll add a prior on the $\nu$ parameter
nu that both allows for completely normally distributed data or normally distributed data with an occasional outlier (I’m using the same prior as in Kruschke’s Bayesian estimation supersedes the t test).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Update: Note that while the priors defined in
robust_model_string are relatively uninformative for my data, they are not necessary uninformative in general. For example, the priors above considers a value of
mu as large as
10000 extremely improbable and a value of
1000 impossible. An example of more vague priors would for example be:
1 2 3 4
If you decide to use the model above, it is advisable that you think about what could be reasonable priors. End of update
Now let’s run the model using JAGS…
1 2 3 4 5 6 7 8 9
Here are the resulting trace plots and density plots…
.. and here is a summary of the parameter estimates:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Using our robust model we get a correlation of -0.28 (95% CI:[-0.69, 0.24]) which is much closer to our initial -0.10 (95% CI: [-0.51, 0.35]). Looking at the estimate for the
nu parameter we see that there is evidence for non-normality as the median
nu is quite small: 7.5 (95% CI: [2.4, 61.5]).
What about data with no outliers?
So now we have two Bayesian versions of Pearson’s correlation test, one normal and one robust. Do we always have to make a choice which of these two models to use? No! I’ll go for the robust version any day, you see, it also estimates the heaviness of the tails of the bivariate t-distribution and if there is sufficient evidence in the data for normality the estimated t-distribution will be very close to a normal distribution. We can have the cake and eat it!
To show that this works I will now apply the robust model to the same data that I used in the last post which are 30 random draws from a bivariate normal distribution.
Running both the standard normal model and the robust t-distribution model on this data results in very similar estimates of the correlation:
Looking at the estimate of
nu we see that it is quite high which is what we would expect since the data is normal.