Publishable Stuff

Rasmus Bååth's Research Blog

Bayesian Estimation of Correlation - Now Robust!

So in the last post I showed how to run the Bayesian counterpart of Pearson’s correlation test by estimating the parameters of a bivariate normal distribution. A problem with assuming normality is that the normal distribution isn’t robust against outliers. Let’s see what happens if we take the data from the last post with the finishing times and weights of the runners in the men’s 100 m semi-finals in the 2013 World Championships in Athletics and introduce an outlier. This is how the original data looks:

1
d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
##                    runner  time weight
## 1              Usain Bolt  9.92     94
## 2           Justin Gatlin  9.94     79
## 3            Nesta Carter  9.97     78
## 4       Kemar Bailey-Cole  9.93     83
## 5         Nickel Ashmeade  9.90     77
## 6            Mike Rodgers  9.93     76
## 7     Christophe Lemaitre 10.00     74
## 8           James Dasaolu  9.97     87
## 9           Zhang Peimeng 10.00     86
## 10           Jimmy Vicaut 10.01     83
## 11         Keston Bledman 10.08     75
## 12       Churandy Martina 10.09     74
## 13         Dwain Chambers 10.15     92
## 14           Jason Rogers 10.15     69
## 15          Antoine Adams 10.17     79
## 16        Anaso Jobodwana 10.17     71
## 17       Richard Thompson 10.19     80
## 18          Gavin Smellie 10.30     80
## 19          Ramon Gittens 10.31     77
## 20 Harry Aikines-Aryeetey 10.34     87

What if Usain Bolt had eaten mutant cheese burgers both gaining in weight and ability to run faster?

1
2
d[d$runner == "Usain Bolt", c("time", "weight")] <- c(9.5, 115)
plot(d$time, d$weight)

plot of chunk unnamed-chunk-3

More Bayesian Football Models

I recently had a lot of fun coding up a Bayesian Poisson model of football scores in the Spanish la Liga. I describe the model, implemented using R and JAGS, in these posts: part 1, part 2, part 3. Turns out that Gianluca Baio and Marta Blangiardo already developed a similar Bayesian model with some extra bells and whistles. For example, they modeled both the defense skill and and offense skill of each team while I used an overreaching skill for each team. They have a paper on their model, which I really enjoyed reading, which can be found here. The model is further discussed on Gianluca Baio’s blog (as I found out after reading about Baio’s blog on statblogs.com). Interesting stuff!

The Abstract of Baio and Blangiardo

The Bayesian Counterpart of Pearson’s Correlation Test

Except for maybe the t test, a contender for the title “most used and abused statistical test” is Pearson’s correlation test. Whenever someone wants to check if two variables relate somehow it is a safe bet (at least in psychology) that the first thing to be tested is the strength of a Pearson’s correlation. Only if that doesn’t work a more sophisticated analysis is attempted (“That p-value is still to big, maybe a mixed models logistic regression will make it smaller…”). One reason for this is that the Pearson’s correlation test is conceptually quite simple and have assumption that makes it applicable in many situations (but it is definitely also used in many situations where underlying assumption are violated).

Since I’ve converted to “Bayesianism” I’ve been trying to figure out what Bayesian analyses correspond to the classical analyses. For t tests, chi-square tests and Anovas I’ve found Bayesian versions that, at least conceptually, are testing the same thing. Here are links to Bayesian versions of the t test, a chi-square test and an Anova, if you’re interested. But for some reason I’ve never encountered a discussion of what a Pearson’s correlation test would correspond to in a Bayesian context. Maybe this is because regression modeling often can fill the same roll as correlation testing (quantifying relations between continuous variables) or perhaps I’ve been looking in the wrong places.

The aim of this post is to explain how one can run the Bayesian counterpart of Pearson’s correlation test using R and JAGS. The model that a classical Pearson’s correlation test assumes is that the data follows a bivariate normal distribution. That is, if we have a list $x$ of pairs of data points then the and the are each assumed to be normally distributed with a possible linear dependency between them. This dependency is quantified by the correlation parameter $\rho$ which is what we want to estimate in a correlation analysis. A good visualization of a bivariate normal distribution with $\rho = 0.3$ can be found on the the wikipedia page on the multivariate normal distribution :

A bivariate normal distribution

Modeling Match Results in La Liga Using a Hierarchical Bayesian Poisson Model: Part three.

In part one and part two of Modeling Match Results in La Liga Using a Hierarchical Bayesian Poisson Model I developed a model for the number of goals in football matches from five seasons of La Liga, the premier Spanish football league. I’m now reasonably happy with the model and want to use it to rank the teams in La Liga and to predict the outcome of future matches!

Ranking the Teams of La Liga

We’ll start by ranking the teams of La Liga using the estimated skill parameters from the 2012/2013 season. The values of the skill parameters are difficult to interpret as they are relative to the skill of the team that had its skill parameter “anchored” at zero. To put them on a more interpretable scale I’ll first zero center the skill parameters by subtracting the mean skill of all teams, I then add the home baseline and exponentiate the resulting values. These rescaled skill parameters are now on the scale of expected number of goals when playing home team. Below is a caterpillar plot of the median of the rescaled skill parameters together with the 68 % and 95 % credible intervals. The plot is ordered according to the median skill and thus also gives the ranking of the teams.

1
2
3
4
5
6
7
8
# The ranking of the teams for the 2012/13 season.
team_skill <- ms3[, str_detect(string = colnames(ms3), "skill\\[5,")]
team_skill <- (team_skill - rowMeans(team_skill)) + ms3[, "home_baseline[5]"]
team_skill <- exp(team_skill)
colnames(team_skill) <- teams
team_skill <- team_skill[, order(colMeans(team_skill), decreasing = T)]
par(mar = c(2, 0.7, 0.7, 0.7), xaxs = "i")
caterplot(team_skill, labels.loc = "above", val.lim = c(0.7, 3.8))

plot of chunk unnamed-chunk-22

Two teams are clearly ahead of the rest, FC Barcelona and Real Madrid CF. Let’s look at the credible difference between the two teams.

CogSci 2013 in Berlin was Hot!

One month ago I went to UseR 2013 in Albacete, Spain wich was a great conference featuring 33° C and many interesting presentations. Surpisingly CogSci 2013 in Berlin, which I attended this week, was even warmer, outside it is currenly 36° C! There were also many interesting presentations at CogSci 2013 with some of the highlights being:

  • The symposia on Music Cognition inlcuding a talk by Edward Large on how to model pitch and harmony perception using dynamical systems.
  • A tutorial on Dynamic Field Theory by John Spencer, Gregor Schoener, and Yulia Sandamirskaya. They are working on an introductory book on Dynamic Field Theory which I’m very much looking forward to.
  • Two presentations on how to approximate Bayesian computation using artificial neural networks and sparse distributed memory.
  • Philip Pärnamets presented fascinating evidence for that what we look at influences decisions in moral desicion tasks and that by manipulating what participants are looking at it is possible to manipulate their moral decisions.

A very diverse list of highlights but that’s cognitive science… I also present a poster (+ conference article) together with Erik Lagerstedt and Peter Gärdenfors on an oscillator model of categorical rhythm perception that we’ve developed. Here’s the abstract:

Categorical perception is a well studied phenomenon in, for example, colour perception, phonetics and music. In this article we implement a dynamical systems model of categorical rhythm perception based on the resonance theory of rhythm perception developed by Large (2010). This model is used to simulate the categorical choices of participants in two experiments of Desain and Honing (2003). The model is able to accurately replicate the experimental data. Our results support that resonance theory is a viable model of rhythm perception and they show that by viewing rhythm perception as a dynamical system it is possible to model properties of categorical perception.

Here’s a link to the full conference article and here is the reference to the article:

Bååth, R., Lagerstedt, E. and Gärdenfors, P. (2013) An Oscillator Model of Categorical Rhythm Perception. In Knauff, M., Pauen, M., Sebanz, N., and Wachsmuth, I. (Eds.) Proceedings of the 35th Annual Conference of the Cognitive Science Society (1803-1808). Austin, TX: Cognitive Science Society.

Modeling Match Results in La Liga Using a Hierarchical Bayesian Poisson Model: Part two.

In the last blog post I showed my initial attempt at modeling football results in La Liga using a Bayesian Poission model, but there was one glaring problem with the model; it did not consider the advantage of being the home team. In this post I will show how to fix this! I will also show a way to deal with the fact that the dataset covers many La Liga seasons and that teams might differ in skill between seasons.

Modeling Match Results: Iteration 2

The only change to the model needed to account for the home advantage is to split the baseline into two components, a home baseline and an away baseline. The following JAGS model implements this change by splitting baseline into home_baseline and away_baseline.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# model 2
m2_string <- "model {
for(i in 1:n_games) {
  HomeGoals[i] ~ dpois(lambda_home[HomeTeam[i],AwayTeam[i]])
  AwayGoals[i] ~ dpois(lambda_away[HomeTeam[i],AwayTeam[i]])
}

for(home_i in 1:n_teams) {
  for(away_i in 1:n_teams) {
    lambda_home[home_i, away_i] <- exp( home_baseline + skill[home_i] - skill[away_i])
    lambda_away[home_i, away_i] <- exp( away_baseline + skill[away_i] - skill[home_i])
  }
}

skill[1] <- 0 
for(j in 2:n_teams) {
  skill[j] ~ dnorm(group_skill, group_tau)
}

group_skill ~ dnorm(0, 0.0625)
group_tau <- 1/pow(group_sigma, 2)
group_sigma ~ dunif(0, 3)

home_baseline ~ dnorm(0, 0.0625)
away_baseline ~ dnorm(0, 0.0625)
}"

Another cup of coffee while we run the MCMC simulation…

Modeling Match Results in La Liga Using a Hierarchical Bayesian Poisson Model: Part one.

Update: This series of posts are now also available as a technical report:
Bååth, R. (2015), Modeling Match Results in Soccer using a Hierarchical Bayesian Poisson Model. LUCS Minor, 18. (ISSN 1104-1609) (pdf)

This is a slightly modified version of my submission to the UseR 2013 Data Analysis Contest which I had the fortune of winning :) The purpose of the contest was to do something interesting with a dataset consisting of the match results from the last five seasons of La Liga, the premium Spanish football (aka soccer) league. In total there were 1900 rows in the dataset each with information regarding which was the home and away team, what these teams scored and what season it was. I decided to develop a Bayesian model of the distribution of the end scores. Here we go…

Ok, first I should come clean and admit that I know nothing about football. Sure, I’ve watched Sweden loose to Germany in the World Cup a couple of times, but that’s it. Never the less, here I will show an attempt to to model the goal outcomes in the La Liga data set provided as part of the UseR 2013 data analysis contest. My goal is not only to model the outcomes of matches in the data set but also to be able to (a) calculate the odds for possible goal outcomes of future matches and (b) to produce a credible ranking of the teams. The model I will be developing is a Bayesian hierarchical model where the goal outcomes will be assumed to be distributed according to a Poisson distribution. I will focus more on showing all the cool things you can easily calculate in R when you have a fully specified Bayesian Model and focus less on model comparison and trying to find the model with highest predictive accuracy (even though I believe my model is pretty good). I really would like to see somebody try to do a similar analysis in SPSS (which most people uses in my field, psychology). It would be a pain!

This post assumes some familiarity with Bayesian modeling and Markov chain Monte Carlo. If you’re not into Bayesian statistics you’re missing out on something really great and a good way to get started is by reading the excellent Doing Bayesian Data Analysis by John Kruschke. The tools I will be using is R (of course) with the model implemented in JAGS called from R using the rjags package. For plotting the result of the MCMC samples generated by JAGS I’ll use the coda package, the mcmcplots package, and the plotPost function courtesy of John Kruschke. For data manipulation I used the plyr and stringr packages and for general plotting I used ggplot2. This report was written in Rstudio using knitr and xtable.

I start by loading libraries, reading in the data and preprocessing it for JAGS. The last 50 matches have unknown outcomes and I create a new data frame d holding only matches with known outcomes. I will come back to the unknown outcomes later when it is time to use my model for prediction.

useR 2013 was a blast!

I had a great time at useR 2013 in Albacete, Spain. The food was great, the people were fun and the weather was hot. A pleasant surprise was that I won the useR data analysis contest with my submission “Modeling Match Results in La Liga Using a Hierarchical Bayesian Poisson Model.” It was a fun exercise modeling football scores and I might write a post about it later, for now you can download my original submission here if you’re interested.

I also got great feedback on my talk “The State of Naming Conventions in R”. The message I was trying to get across was basically:

  • Naming conventions are important.
  • You’re not totally off if you use lowerCamelCase for function names.
  • You’re totally off if you use UpperCamelCase for function names.

The talk was based on a short article I got into the RJournal and here are the slides:

UseR2013 Presentation Slides

Some more awesomeness from UseR 2013:

  • Rcpp allows for seamless R and C++ Integration and is ready (talk by Romain Francois).
  • Markus Gesmann and Diego de Castilo held a tutorial on the Google viz package. Hans Roslingesque motion charts FTW!
  • Kimberly Speerschneider held a presenation on how to vizualize likert scales using the likert package.
  • Richie Cotton held an interesting talk with the equally interesting title the Secrets of inverse brogramming. Like me he touched on naming conventions but he also gave a much broader view on how to write beatuiful code in R. The essence of his talk can be found in his very sensible R code style guide.
  • The upcoming version of Rstudio boasts a graphical debugger and allows you to write presentations using markdown.

Three Ways to Run Bayesian Models in R

There are different ways of specifying and running Bayesian models from within R. Here I will compare three different methods, two that relies on an external program and one that only relies on R. I won’t go into much detail about the differences in syntax, the idea is more to give a gist about how the different modeling languages look and feel. The model I will be implementing assumes a normal distribution with fairly wide priors:

Let’s start by generating some normally distributed data to use as example data in the models.

1
2
3
set.seed(1337)
y <- rnorm(n = 20, mean = 10, sd = 5)
mean(y)
1
## [1] 12.72
1
sd(y)
1
## [1] 5.762

Method 1: JAGS

JAGS (Just Another Gibbs Sampler) is a program that accepts a model string written in an R-like syntax and that compiles and generate MCMC samples from this model using Gibbs sampling. As in BUGS, the program that inspired JAGS, the exact sampling procedure is chosen by an expert system depending on how your model looks. JAGS is conveniently called from R using the rjags package and John Myles White has written a nice introduction to using JAGS and rjags. I’ve seen it recommended to put the JAGS model specification into a separate file but I find it more convenient to put everything together in one .R file by using the textConnection function to avoid having to save the model string to a separate file. Now, the following R code specifies and runs the model above for 20000 MCMC samples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
library(rjags)

# The model specification
model_string <- "model{
  for(i in 1:length(y)) {
    y[i] ~ dnorm(mu, tau)
  }
  mu ~ dnorm(0, 0.0001)
  sigma ~ dlnorm(0, 0.0625)
  tau <- 1 / pow(sigma, 2)
}"

# Running the model
model <- jags.model(textConnection(model_string), data = list(y = y), n.chains = 3, n.adapt= 10000)
update(model, 10000); # Burnin for 10000 samples
mcmc_samples <- coda.samples(model, variable.names=c("mu", "sigma"), n.iter=20000)

Bayesian estimation supersedes the t-test - Online

The t-test is probably the most used and abused statistical test in research [Citation Needed]. The t-test is used when you have two groups of measures and you want to look at the difference between the groups and the original t-test assumes that the measures are approximately normally distributed and that the two groups have equal variance. While this last assumption is relaxed by Welch’s t-test a large problem remains: The t-test isn’t Bayesian :)

John Kruschke has deviced a Bayesian model that can be used as a drop in replacement for the classical t-test and that he boastingly dubbed BEST (Bayesian EStimation Supersedes the T-test). Instead of uninformative p-values it gives you the actual probability that the two groups are different (given the model), isn’t that nice? BEST even adds some more bells and whistles, the data (not the sampling distribution as in the original t-test) is assumed to come from a t-distribution with an unknown degrees of freedom and as the t-distribution has thicker tails than the normal distribution this helps reducing the influence of outliers in the data. BEST also estimates many more parameters than just the difference between the groups, for example, the difference in variance and the effect size. The full model used by BEST is given in this diagram:

A diagram of the BEST model.

So what is my contribution to all this? Well, I’ve made an online version of BEST that lets you do Bayesian statistics in your browser! I was interested mainly in how well javascript would handle the relatively heavy Markov Chain Montecarlo simulations necessary to run many Bayesian models and BEST seemed like a good target. Turns out it works really well, javascript is pretty fast and with the help of the FLOT library it was easy to create nice looking plots of the analysis. Try it out at http://sumsar.net/best_online/!