Hello stranger, and welcome! 👋😊
I'm Rasmus Bååth, data scientist, engineering manager, father, husband, tinkerer,
tweaker, coffee brewer, tea steeper, and, occasionally, publisher of stuff I find
interesting down below👇
Why R? Because S!
R is the open source implementation (and a pun!) of S, a language for statistical computing that was developed at Bell Labs in the late 1970s. After that, the implementation of S underwent a number of major revisions documented in a series of seminal books, often just referred to by the color of their cover: The Brown Book, the Blue Book, the White Book and the Green Book. To satisfy my techno-historical lusts I recently acquired all these books and I though I would share some tidbits from them, highlighting how S (and thus R) developed into what we today love and cherish. But first, here are the books in chronological order from left to right:
After having broken the Bayesian eggs and prepared your model in your statistical kitchen the main dish is the posterior. The posterior is the posterior is the posterior, given the model and the data it contains all the information you need and anything else will be a little bit less nourishing. However, taking in the posterior in one gulp can be a bit difficult, in all but the most simple cases it will be multidimensional and difficult to plot. But even if it is one-dimensional and you could plot it (as, say, a density plot) that does not necessary mean that it is easy to see what’s going on.
One way of getting around this is to take a bite at a time and look at summaries of the marginal posteriors of the variables of interest, the two most common type of summaries being point estimates and credible intervals (an interval that covers a certain percentage of the probability distribution). Here one is faced with a choice, which of the many ways of constructing point estimates and credible intervals to choose? This is a perfectly good question that can be given an unhelpful answer (with a predictable follow-up question):
- That depends on your loss function.
- So which loss function should I use?
Big data is all the rage, but sometimes you don’t have big data. Sometimes you don’t even have average size data. Sometimes you only have eleven unique socks:
Karl Broman is here putting forward a very interesting problem. Interesting, not only because it involves socks, but because it involves what I would like to call Tiny Data™. The problem is this: Given the Tiny dataset of eleven unique socks, how many socks does Karl Broman have in his laundry in total?
If we had Big Data we might have been able to use some clever machine learning algorithm to solve this problem such as bootstrap aggregated neural networks. But we don’t have Big Data, we have Tiny Data. We can’t pull ourselves up by our bootstraps because we only have socks (eleven to be precise). Instead we will have to build a statistical model that includes a lot more problem specific information. Let’s do that!
As the normal distribution is sort of the default choice when modeling continuous data (but not necessarily the best choice), the Poisson distribution is the default when modeling counts of events. Indeed, when all you know is the number of events during a certain period it is hard to think of any other distribution, whether you are modeling the
number of deaths in the Prussian army due to horse kicks or the
numer of goals scored in a football game. Like the t.test
in R there is also a poisson.test
that takes one or two samples of counts and spits out a p-value. But what if you have some counts, but don’t significantly feel like testing a null hypothesis? Stay tuned!
I was lucky enough to be presenting at
the 14th International Conference on Music Perception and Cognition in Seoul, South Korea, last week. It was a very inspiring conference and I really like South Korea (especially the amazing food…). My presentation was on the topic of subjective rhythmization, a fascinating auditory illusion. See below for my
presentation slides and for a short conference paper that was published in the proceedings of the conference:
Bååth, R., Ingvarsdóttir, K. O. (2014) Subjective Rhythmization: A Replication And an Extension. Proceedings of the 13th International Conference on Music Perception and Cognition in Seoul, South Korea.
pdf of full paper
Inspired by events that took place at
UseR 2014 last month I decided to implement an app that estimates one’s blood alcohol concentration (BAC). Today I present to you
drinkR, implemented using R and
Shiny, Rstudio’s framework for building web apps using R. So, say that I had a good dinner, drinking a couple of glasses of wine, followed by an evening at a divy karaoke bar, drinking a couple of
red needles and a couple of beers. By entering my sex, height and weight and the times when I drank the drinks in the
drinkR app I end up with this estimated BAC curve:
(Now I might be totally off with what drinks I had and when but
Romain Francois,
Karl Broman,
Sandy Griffith,
Karthik Ram and
Hilary Parker can probably fill in the details.) If you want to estimate your current BAC (or a friend’s…) then head over to
the drinkr app hosted at ShinyApps.io. If you want to know how the app estimates BAC read on below.
The code for drinkR is available on GitHub, any suggestion on how it can be improved is greatly appreciated.
This year’s UseR! conference was held at the University of California in Los Angeles. Despite the great weather and a nearby beach, most of the conference was spent in front of projector screens in 18° c (64° f) rooms because there were so many interesting presentations and tutorials going on. I was lucky to present my R package
Bayesian First Aid and the slides can be found here:
Even though I said it would never happen, my silly package with the sole purpose of playing notification sounds is now on CRAN. Big thanks to the CRAN maintainers for their patience! For instant gratification run the following in R to install beepr
and make R produce a notification sound:
install.packages("beepr")
library(beepr)
beep()
Does pill A or pill B save the most lives? Which web design results in the most clicks? Which in vitro fertilization technique results in the largest number of happy babies? A lot of questions out there involves estimating the proportion or relative frequency of success of two or more groups (where success could be a saved life, a click on a link, or a happy baby) and there exists a little known R function that does just that, prop.test
. Here I’ll present the Bayesian First Aid version of this procedure. A word of caution, the example data I’ll use is mostly from the
Journal of Human Reproduction and as such it might be slightly NSFW :)
As I’m more or less an autodidact when it comes to statistics, I have a weak spot for books that try to introduce statistics in an accessible and pedagogical way. I have therefore collected what I believe are all books that introduces statistics using comics (at least those written in English). What follows are highly subjective reviews of those four books. If you know of any other comic book on statistics, please do tell me!
I’ll start with a tl;dr version of the reviews, but first here are the four books: