# An Animation of the t Distribution as a Mixture of Normals

You’ve probably heard about the t distribution. One good use for this distribution is as an alternative to the normal distribution that is more robust against outliers. But where does the t distribution come from? One intuitive characterization of the t is as a mixture of normal distributions. More specifically, as a mixture of an infinite number of normal distributions with a common mean $\mu$ but with precisions (the inverse of the variance) that are randomly distributed according to a gamma distribution. If you have a hard time picturing an infinite number of normal distributions you could also think of a t distribution as a normal distribution with a standard deviation that “jumps around”.

Using this characterization of the t distribution we could generate random samples $y$ from a t distribution with a mean $\mu$, a scale $s$ and a degrees of freedom $\nu$ as:

This brings me to the actual purpose of this post, to show off a nifty visualization of how the t can be seen as a mixture of normals. The animation below was created by drawing 6000 samples of $1/\sigma^2$ from a $\text{Gamma}(\text{shape}= 2 / 2, \text{rate} = 3^2 \cdot 2 / 2)$ distribution and using these to construct 6000 normal distribution with $\mu=0$. Drawing a sample from each of these distributions should then be the same as sampling from a $\text{t}(\mu=0,s=3,\nu=2)$ distribution. But is it? Look for yourself:

# Shaping up Laplace Approximation using Importance Sampling

In the last post I showed how to use Laplace approximation to quickly (but dirtily) approximate the posterior distribution of a Bayesian model coded in R. This is just a short follow up where I show how to use importance sampling as an easy method to shape up the Laplace approximation in order to approximate the true posterior much better. But first, what is importance sampling?

## Importance Sampling

Importance sampling is a method that can be used to estimate the integral of a function even though we only have access to the density of that function at any given point. Using a further resampling step, this method can be used to produce samples that are (approximately) distributed according to the density of the given function. In Bayesian statistics this is what we often want to do with a function that describes the density of the posterior distribution. In the case when there is more than one free parameter the posterior will be multidimensional and there is nothing stopping importance sampling from working in many dimensions. Below I will only visualize the one dimensional case, however.

The first step in the importance sampling scheme is to define a proposal distribution that we believe covers most of our target distribution, the distribution that we actually want to estimate. Ideally we would want to choose a proposal distribution that is the same as the target distribution, but if actually could do this we would already be able to sample from the target distribution and importance sampling would be unnecessary. Instead we can make an educated guess. A sloppy guess is to use a uniform proposal distribution that we believe covers most of the density of the target distribution. For example, assuming the target distribution has most of its density in $[-4,4]$ we could setup the importance sampling as:

Here, the target distribution is actually a $\text{Normal}(\mu=1,\sigma=1)$ distribution, but most often we do not know its shape.

# Easy Laplace Approximation of Bayesian Models in R

Thank you for tuning in! In this post, a continuation of Three Ways to Run Bayesian Models in R, I will:

1. Handwave an explanation of the Laplace Approximation, a fast and (hopefully not too) dirty method to approximate the posterior of a Bayesian model.
2. Show that it is super easy to do Laplace approximation in R, basically four lines of code.
3. Put together a small function that makes it even easier, if you just want this, scroll down to the bottom of the post.

But first a picture of the man himself:

# DIY Kruschke Style Diagrams

I argued recently that a good way of communicating statistical models graphically was by using the convention devised by John K. Kruschke in his book Doing Bayesian Data Analysis. John Kruschke describes these diagrams in more detail on his blog: here, here and here. While I believe these kinds of diagrams are great in many ways there is a problem in that they are quite tricky to make. That is, until now!

I have put together a template for the free and open source LibreOffice Draw which makes it simple to construct Kruschke style diagrams such as the one below:

# How Do You Write Your Model Definitions?

I’m often irritated by that when a statistical method is explained, such as linear regression, it is often characterized by how it can be calculated rather than by what model is assumed and fitted. A typical example of this is that linear regression is often described as a method that uses ordinary least squares to calculate the best fitting line to some data (for example, see here, here and here).

From my perspective ordinary least squares is better seen as the computational method used to fit a linear model, assuming normally distributed residuals, rather than being what you actually do when fitting a linear regression. That is, ordinary least squares is one of many computational methods, such as gradient descent or simulated annealing, used to find the maximum likelihood estimates of linear models with normal residuals. By characterizing linear regression as a calculation to minimize the squared errors you hide the fact that linear regression actually assumes a model that is simple and easy to understand. I wish it was more common to write out the full probability model that is assumed when describing a statistical procedure as it would make it easier to quickly grasp the assumptions of the procedure and facilitate comparison between different procedures.

But how to write out the full probability model? In this post I’m going to show two different textual conventions and two different graphical conventions for communicating probability models. (Warning, personal opinions ahead!)

# A Bayesian Twist on Tukey’s Flogs

In the last post I described flogs, a useful transform on proportions data introduced by John Tukey in his Exploratory Data Analysis. Flogging a proportion (such as, two out of three computers were Macs) consisted of two steps: first we “started” the proportion by adding 1/6 to each of the counts and then we “folded” it using what was basically a rescaled log odds transform. So for the proportion two out of three computers were Macs we would first add 1/6 to both the Mac and non-Mac counts resulting in the “started” proportion (2 + 1/6) / (2 + 1/6 + 1 + 1/6) = 0.65. Then we would take this proportion and transform it to the log odds scale.

The last log odds transform is fine, I’ll buy that, but what are we really doing when we are “starting” the proportions? And why start them by the “magic” number 1/3? Maybe John Tukey has the answer? From page 496 in Tukey’s EDA:

The desirability of treating “none seen below” as something other than zero is less clear, but also important. Here practice has varied, and a number of different usages exist, some with excuses for being and others without. The one we recommend does have an excuse but since this excuse (i) is indirect and (ii) involves more sophisticated considerations, we shall say no more about it. What we recommend is adding 1/6 to all […] counts, thus “starting” them.

Not particularly helpful… I don’t know what John Tukey’s original reason was (If you know, please tell me in the comments bellow!) but yesterday I figured out a reason that I’m happy with. Turns out that starting proportions by adding 1/6 to the counts is an approximation to the median of the posterior probability of the $\theta$ parameter of a Bernouli distribution when using Jeffrey’s prior on $\theta$. I’ll show you in a minute why this is the case but first I just want to point out that intuitively this is roughly what we would want to get when “starting” proportions.

# Going to Plot Some Proportions? Why not Flog ‘em First?

Fractions and proportions can be difficult to plot nicely for a number of reasons:

• If the proportions are based on small counts (e.g., two of his three computing devices were Apple products) then the calculated proportions will only take on a number of discrete values.
• Depending on what you have measured there might be many proportions close to the 0 % and 100 % edges of the scale (Five of his five computing devices were non-Apple products).
• There is no difference made between a proportion that is the result of few counts (33 %, one out of three, were Apple products) and large counts (33 %, three out of nine, were Apple products).

Especially irritating is that the first two of these problems results in overplotting that makes it hard to see what is going on in the data. So how to plot proportions without running into these problems? John Tukey to the rescue!

# RPPW 2013 was Groovy!

Just came home from the 14th Rhythm Perception and Production Workshop, this time taking place at the University of Birmingham, UK. Like the last time in Leipzig there were lots of great talks and much food for thought. Some highlights, from my personal perspective:

• Bayesian perception of isochronous rhythms presented by Darren Rhodes and Massimiliano Di Luca. Interesting idea and nice to see a fellow Bayesian.
• Slowness in music performance and perception: An analysis of timing in Feldman’s “Last Pieces” presented by Dirk Moelants. Fun talk that dealt with how “slow” was interpred by performers of Feldman’s “Last Pieces”. Here is a not too slow rendition of some “Last Pieces” on youtube.
• Parameter estimation of sensorimotor synchronization models presented by Nori Jacoby, Peter Keller and Bruno Repp. Technical and interesting talk and I believe they have some papers out on this already.
• Two interesting talk on resonance models and EEG: Scrutinizing subjective rhythmization: A combined ERP/oscillatory approach by Christian Obermeier and Sonja A. Kotz, and From Sounds to Movement and Back: How Movement Shapes Neural Representation of Beat and Meter presented by Sylvie Nozaradan, Isabelle Peretz, André Mouraux

# SPSS looked great! 20 years ago…

For some reason someone dropped a pamphlet advertising SPSS for Windows 3.0 in my mail box at work. This means that the pamphlet, and the advertised version of SPSS, should be at least 20 years old! These days I’m happily using R for everything but if I was going to estimate any models 20 years ago SPSS actually looked quite OK. In the early 90s my interests in computing were more related to making Italian plumbers rescue princesses than estimating regression models.

The pamphlet is quite fun to read, there are many things in it that feels really out of date and which shows how much have happened in computing in the last 20 years. Let me show you some of the highlights!

Here is a picture from the pamphlet of an SPSS session in action:

Actually it still looks very similar to the SPSS I saw my colleague use yesterday, so maybe not that much have changed in 20 years… Well let’s look at the hardware requirements:

# Bayesian Estimation of Correlation - Now Robust!

So in the last post I showed how to run the Bayesian counterpart of Pearson’s correlation test by estimating the parameters of a bivariate normal distribution. A problem with assuming normality is that the normal distribution isn’t robust against outliers. Let’s see what happens if we take the data from the last post with the finishing times and weights of the runners in the men’s 100 m semi-finals in the 2013 World Championships in Athletics and introduce an outlier. This is how the original data looks:

What if Usain Bolt had eaten mutant cheese burgers both gaining in weight and ability to run faster?