The Bayesian Counterpart of Pearson's Correlation Test

Except for maybe the t test, a contender for the title “most used and abused statistical test” is Pearson’s correlation test. Whenever someone wants to check if two variables relate somehow it is a safe bet (at least in psychology) that the first thing to be tested is the strength of a Pearson’s correlation. Only if that doesn’t work a more sophisticated analysis is attempted (“That p-value is still to big, maybe a mixed models logistic regression will make it smaller…”). One reason for this is that the Pearson’s correlation test is conceptually quite simple and have assumption that makes it applicable in many situations (but it is definitely also used in many situations where underlying assumption are violated).

Since I’ve converted to “Bayesianism” I’ve been trying to figure out what Bayesian analyses correspond to the classical analyses. For t tests, chi-square tests and Anovas I’ve found Bayesian versions that, at least conceptually, are testing the same thing. Here are links to Bayesian versions of the t test, a chi-square test and an Anova, if you’re interested. But for some reason I’ve never encountered a discussion of what a Pearson’s correlation test would correspond to in a Bayesian context. Maybe this is because regression modeling often can fill the same roll as correlation testing (quantifying relations between continuous variables) or perhaps I’ve been looking in the wrong places.

The aim of this post is to explain how one can run the Bayesian counterpart of Pearson’s correlation test using R and JAGS. The model that a classical Pearson’s correlation test assumes is that the data follows a bivariate normal distribution. That is, if we have a list $x$ of pairs of data points $[[x_{1,1},x_{1,2}],[x_{2,1},x_{2,2}],[x_{3,1},x_{3,2}],...]$ then the $x_{i,1} \text{s}$ and the $x_{i,2} \text{s}$ are each assumed to be normally distributed with a possible linear dependency between them. This dependency is quantified by the correlation parameter $\rho$ which is what we want to estimate in a correlation analysis. A good visualization of a bivariate normal distribution with $\rho = 0.3$ can be found on the the wikipedia page on the multivariate normal distribution :

We will assume the same model and our Bayesian correlation analysis then reduces to estimating the parameters of a bivariate normal distribution given some data. One problem is that the bivariate normal distribution and, more general, the multivariate normal distribution isn’t parameterized using $\rho$, that is, we cannot estimate $\rho$ directly. The bivariate normal is parameterized using $\mu_1$ and $\mu_2$ the means of the two marginal distributions (the red and blue normal distribution in the graph above) and a covariance matrix $\Sigma$ which defines $\sigma_1^2$ and $\sigma_2^2$, the variances of the two marginal distributions, and the covariance, how much the marginal distributions vary together. So the covariance is another measure of how much two variables vary together and the covariance corresponding to a correlation of $\rho$ can be calculated as $\rho \cdot \sigma_1 \cdot \sigma_2$. So here is the model we want to estimate:

Implementation

So, how to implement this model? I’m going to do it with R and the JAGS sampler interfaced with R using the rjags package. First I’m going to simulate some data with a correlation of 0.7 to test the model with.

The simulated data looks quite correlated and a classical Pearson’s correlation test confirms this:

The JAGS model below implements the bivariate normal model described above. One difference is that JAGS parameterizes the normal and multivariate normal distributions with precision instead of standard deviation or variance. The precision is the inverse of the variance so in order to use a covariance matrix as a parameter to dmnorm we have to inverse it first using the inverse function.

Update: Note that while the priors defined in model_string are relatively uninformative for my data, they are not necessary uninformative in general. For example, the priors above considers a value of mu[1] as large as 10000 extremely improbable and a value of sigma[1] above 1000 impossible. An example of more vague priors would for example be:

If you decide to use the model above, it is advisable that you think about what could be reasonable priors. End of update

An extra feature is that the model above generates random samples (x_rand) from the estimated bivariate normal distribution. These samples can be compared to the actual data in order to get a sense of how well the model fits the data. Now let’s use JAGS to sample from this model. I’m using the textConnection trick (described here) in order to run the model without having to first save the model string to a file.

Now let’s plot trace plots and density plots of the MCMC parameter estimates:

The trace plots look sufficiently furry and stationary and looking at the density plots it looks like the model have captured the “true” parameters well (those used when we created the data). Looking at point estimates and credible intervals confirms this:

The median of the rho parameter is -0.76 (95% CI: [-0.88, -0.55]) close to the true parameter value -0.7. Even if this is not that strange, we did use a bivariate normal distribution both to generate and model the data, it is nice when things work out :) Now let’s calculate the probability that there actually is a negative correlation.

Given the model and the data it seems like the probability that there is a negative correlation is 100%. Accounting for the MCMC error I think it is fair to downgrade this probability to at least 99.9%. We can also use the random samples from the bivariate normal distribution estimated by the model to compare how well the model fits the data. The following code plots the original data with two superimposed ellipses where the inner ellipse covers 50% of the density of the distribution and the outer ellipse covers 95%.

Quite a good fit!

Update: If you are more of a Python person, check out Philipp Singer’s Bayesian Correlation with PyMC, which implements the model above in Python.

Analysis of Some “Real” Data

So let’s use this model on some real data. The data.frame below contains the names, weights in kg and finishing times for all participants of the men’s 100 m semi-finals in the 2013 World Championships in Athletics. Well, those I could find the weights of anyway…

So, I know nothing about running (and I’m not sure this is a very representative data set…) but my hypothesis is that there should be a positive correlation between weight and finishing time. That is, the more you weigh the slower you run. Sounds like logic, right? Let’s look at the data…

At a first glance it seems like my hypothesis is not supported by the data. I wonder what our model has to say about that?

Seems like there is no support for my hypothesis, the posterior distribution of rho is centered around zero and, if anything, there might be a tiny negative correlation. So your weight doesn’t seem to influence how fast you run (if you are a runner in the 100 m semi finals, at least).