Big data is all the rage, but sometimes you don’t have big data. Sometimes you don’t even have average size data. Sometimes you only have eleven unique socks:
Karl Broman is here putting forward a very interesting problem. Interesting, not only because it involves socks, but because it involves what I would like to call Tiny Data™. The problem is this: Given the Tiny dataset of eleven unique socks, how many socks does Karl Broman have in his laundry in total?
If we had Big Data we might have been able to use some clever machine learning algorithm to solve this problem such as bootstrap aggregated neural networks. But we don’t have Big Data, we have Tiny Data. We can’t pull ourselves up by our bootstraps because we only have socks (eleven to be precise). Instead we will have to build a statistical model that includes a lot more problem specific information. Let’s do that!
We are going to start by building a generative model, a simulation of the I’m-picking-out-socks-from-my-laundry process. First we have a couple of parameters that I’m just for now are going to give arbitrary values:
n_socks <- 18 # The total number of socks in the laundry
n_picked <- 11 # The number of socks we are going to pick
In an ideal world all socks would come in pairs, but we’re not living in an ideal world and some socks are odd (aka singletons). So out of the n_socks
let’s say we have
n_pairs <- 7 # for a total of 7*2=14 paired socks.
n_odd <- 4
We are now going to create a vector of socks, represented as integers, where each pair/singleton is given a unique number.
socks <- rep( seq_len(n_pairs + n_odd), rep(c(2, 1), c(n_pairs, n_odd)) )
socks
## [1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 9 10 11
Finally we are going to simulate picking out n_picked
socks (or at least n_socks
if n_picked
> n_socks
) and counting the number of sock pairs and unique socks.
picked_socks <- sample(socks, size = min(n_picked, n_socks))
sock_counts <- table(picked_socks)
sock_counts
## picked_socks
## 1 3 4 5 7 8 9 10 11
## 1 2 2 1 1 1 1 1 1
c(unique = sum(sock_counts == 1), pairs = sum(sock_counts == 2))
## unique pairs
## 7 2
So for this particular run of the sock picking simulation we ended up with two pairs and seven unique socks. So far so good, but what about the initial problem? How to estimate the actual number of socks in Karl ‘s laundry? Oh, but what you might not realize is that we are almost done! :)
Approximate Bayesian Computation (ABC) is a super cool method for fitting models with the benefits of (1) being pretty intuitive and (2) only requiring the specification of a generative model, and with the disadvantages of (1) being extremely computationally inefficient if implemented naïvely and (2) requiring quite a bit of tweaking to work correctly when working with even quite small datasets. But we are not working with Quite Small Data, we are working with Tiny Data! Therefore we can afford a naïve and inefficient (but straight forward) implementation. Fiercely hand-waving, the simple ABC rejection algorithm goes like this:
For a less brief introduction to ABC see the tutorial on Darren Wilkinson’s blog. The paper by Rubin (1984) is also a good read, even if it doesn’t explicitly mention ABC.
So what’s left until we can estimate the number of socks in Karl Broman’s laundry? Well, we have a reasonable generative model, however, we haven’t specified any prior distributions over the parameters we are interested in: n_socks
, n_pairs
and n_odd
. Here we can’t afford to use non-informative priors, that’s a luxury reserved for the Big Data crowd, we need to use all the information we have. Also, the trade-off isn’t so much about how informative/biased we should be but rather about how much time/resources we can spend on developing reasonable priors. The following is what I whipped up in half an hour and which could surely be improved upon:
What can be said about n_socks
, the number of socks in Karl Broman’s laundry, before seeing any data? Well, we know that n_socks
must be positive (no anti-socks) and discrete (socks are not continuous). A reasonable choice would perhaps be to use a Poisson distribution as a prior, however the Poisson is problematic in that both its mean and its variance is set by the same parameter. Instead we could use the more flexible cousin to the Poisson, the negative binomial. In R the rnbinom
function is parameterized with the mean mu
and size
. While size
is not the variance, there is a direct correspondence between size
and the variance s^2
:
size = -mu^2 / (mu - s^2)
If you are a family of 3-4 persons and you change socks around 5 times a week then a guesstimate would be that you have something like 15 pairs of socks in the laundry. It is reasonable that you at least have some socks in your laundry, but it is also possible that you have much more than 15 * 2 = 30 socks. So as a prior for n_socks
I’m going to use a negative binomial with mean prior_mu = 30
and standard deviation prior_sd = 15
.
prior_mu <- 30
prior_sd <- 15
prior_size_param <- -prior_mu^2 / (prior_mu - prior_sd^2)
n_socks <- rnbinom(1, mu = prior_mu, size = prior_size_param)
Instead of putting a prior distribution directly over n_pairs
and n_odd
I’m going to put it over the proportion of socks that come in pairs prop_pairs
. I know some people keep all their socks neatly paired, but only 3/4 of my socks are in a happy relationship. So on prop_pairs
I’m going to put a Beta prior distribution that puts most of the probability over the range 0.75 to 1.0. Since socks are discrete entities we’ll also have to do some rounding to then go from prop_pairs
to n_pairs
and n_odd
.
prop_pairs <- rbeta(1, shape1 = 15, shape2 = 2)
n_pairs <- round(floor(n_socks / 2) * prop_pairs)
n_odd <- n_socks - n_pairs * 2
Now we have a generative model, with reasonable priors, and what’s left is to use the ABC rejection algorithm to generate a posterior distribution of the number of socks in Karl Broman’s laundry. The following code bring all the earlier steps together and generates 100,000 samples from the generative model which are saved, together with the corresponding parameter values, in sock_sim
:
n_picked <- 11 # The number of socks to pick out of the laundry
sock_sim <- replicate(100000, {
# Generating a sample of the parameters from the priors
prior_mu <- 30
prior_sd <- 15
prior_size <- -prior_mu^2 / (prior_mu - prior_sd^2)
n_socks <- rnbinom(1, mu = prior_mu, size = prior_size)
prop_pairs <- rbeta(1, shape1 = 15, shape2 = 2)
n_pairs <- round(floor(n_socks / 2) * prop_pairs)
n_odd <- n_socks - n_pairs * 2
# Simulating picking out n_picked socks
socks <- rep(seq_len(n_pairs + n_odd), rep(c(2, 1), c(n_pairs, n_odd)))
picked_socks <- sample(socks, size = min(n_picked, n_socks))
sock_counts <- table(picked_socks)
# Returning the parameters and counts of the number of matched
# and unique socks among those that were picked out.
c(unique = sum(sock_counts == 1), pairs = sum(sock_counts == 2),
n_socks = n_socks, n_pairs = n_pairs, n_odd = n_odd, prop_pairs = prop_pairs)
})
# just translating sock_sim to get one variable per column
sock_sim <- t(sock_sim)
head(sock_sim)
## unique pairs n_socks n_pairs n_odd prop_pairs
## [1,] 7 2 32 15 2 0.9665
## [2,] 7 2 21 9 3 0.9314
## [3,] 3 4 20 8 4 0.8426
## [4,] 11 0 47 23 1 0.9812
## [5,] 9 1 36 15 6 0.8283
## [6,] 7 2 16 5 6 0.6434
We have used quite a lot of prior knowledge, but so far we have not used the actual data. In order to turn our simulated samples sock_sim
into posterior samples, informed by the data, we need to throw away those simulations that resulted in simulated data that doesn’t match the actual data. The data we have is that out of eleven picked socks, eleven were unique and zero were matched, so let’s remove all simulated samples which does not match this.
post_samples <- sock_sim[sock_sim[, "unique"] == 11 &
sock_sim[, "pairs" ] == 0 , ]
And we are done! Given the model and the data, the 11,506 remaining samples in post_samples
now represent the information we have about the number of socks in Karl Broman’s laundry. What remains is just to explore what post_samples
says about the number of socks. The following plot shows the prior sock distributions in green, and the posterior sock distributions (post_samples
) in blue:
Here the vertical red lines show the median posterior, a “best guess” for the respective parameter. There is a lot of uncertainty in the estimates but our best guess (the median posterior) would be that there in Karl Broman’s laundry are 19 pairs of socks and 6 odds socks for a total of 19 × 2 + 6 = 44 socks.
How well did we do? Fortunately Karl Broman later tweeted the actual number of pairs and odd socks:
Which totals 21 × 2 + 3 = 45 socks. We were only off by one sock! Our estimate of the number of odd socks was a little bit high (Karl Broman is obviously much better at organizing his socks than me) but otherwise we are actually amazingly spot on! All thanks to carefully selected priors, approximate Bayesian computation and Tiny Data.
For full disclosure I must mention that Karl made this tweet before I had finished the model, however, I tried hard not to be influenced by that when choosing the priors of the model.
If you look at the plot of the priors and posteriors above you will notice that they are quite similar. “Did we even use the data?”, you might ask, “it didn’t really seem to make any difference anyway…”. Well, since we are working with Tiny Data it is only natural that the data does not make a huge difference. It does however make a difference. Without any data (also called No Data™) the median posterior number of socks would be 28, a much worse estimate than the Tiny Data estimate of 44. We could also see what would happen if we would fit the model to a different dataset. Say we got four pairs of socks and three unique socks out of the eleven picked socks. We would then calculate the posterior like this…
post_samples <- sock_sim[sock_sim[, "unique"] == 3 &
sock_sim[, "pairs" ] == 4 , ]
The posterior would look very different and our estimate of the number of total socks would be as low as 15:
So our Tiny Data does matter! There are, of course, other ways to criticize my model:
As a final note, don’t forget that you can follow Karl Broman on Twitter, he is a cool guy (even his tweets about laundry are thought provoking!) and if you aren’t already tired of clothes related analyses and Tiny Data you can follow me too :)
Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4), 1151-1172. fultext
]]>As the normal distribution is sort of the default choice when modeling continuous data (but not necessarily the best choice), the Poisson distribution is the default when modeling counts of events. Indeed, when all you know is the number of events during a certain period it is hard to think of any other distribution, whether you are modeling the number of deaths in the Prussian army due to horse kicks or the numer of goals scored in a football game. Like the t.test
in R there is also a poisson.test
that takes one or two samples of counts and spits out a p-value. But what if you have some counts, but don’t significantly feel like testing a null hypothesis? Stay tuned!
Bayesian First Aid is an attempt at implementing reasonable Bayesian alternatives to the classical hypothesis tests in R. For the rationale behind Bayesian First Aid see the original announcement. The development of Bayesian First Aid can be followed on GitHub. Bayesian First Aid is a work in progress and I’m grateful for any suggestion on how to improve it!
The original poisson.test
function that comes with R is rather limited, and that makes it fairly simple to construct the Bayesian alternative. However, at first sight poisson.test
may look more limited than it actually is. The one sample version just takes one count of events $x$ and the number of periods $T$ during which the number of events were counted. If your ice cream truck sold 14 ice creams during one day you would call the function like poisson.test(x = 14, T = 1)
. This seems limited, what if you have a number of counts, say you sell ice cream during a whole week, what to do then? The trick here is that you can add up the counts and the number of time periods and this will be perfectly fine. The code below will still give you an estimate for the underlying rate of ice cream sales per day:
ice_cream_sales = c(14, 16, 9, 18, 10, 6, 13)
poisson.test(x = sum(ice_cream_sales), T = length(ice_cream_sales))
Note that this only works if the counts are well modeled by the same Poisson distribution. If the ice cream sales are much higher on the weekends, adding up the counts might not be a good idea. poisson.test
is also limited in that it can only handle two counts; you can compare the performance of your ice cream truck with just one competitor’s, no more. As the Bayesian alternative accepts the same input as poisson.test
it inherits some of it’s limitations (but it can easily be extended, read on!). The model for the Bayesian First Aid alternative to the one sample possion.test
is:
Here $x$ is again the count of events, $T$ is the number of periods, and $\lambda$ is the parameter of interest, the underlying rate at which the events occur. In the two sample case the one sample model is just separately fitted to each sample.
As $x$ is assumed to be Poisson distributed, all that is required to turn this into a fully Bayesian model is a prior on $\lambda$. In the the literature there are two common recommendations for an objective prior for the rate of a Poisson distribution. The first one is $p(\lambda) \propto 1 / \lambda$ which is the same as $p(log(\lambda)) \propto \text{const}$ and is proposed, for example, by Villegas (1977). While it can be argued that this prior is as non-informative as possible, it is problematic in that it will result in an improper posterior when the number of events is zero ($x = 0$). I feel that seeing zero events should tell the model something and, at least, not cause it to blow up. The second proposal is Jeffreys prior $p(\lambda) \propto 1 / \sqrt{\lambda}$, (as proposed by the great BUGS Book) which has a slight positive bias compared to the former prior but handles counts of zero just fine. The difference between these two priors is very small and will only matter when you have very few counts. Therefore the Bayesian First Aid alternative to poisson.test
uses Jeffreys prior.
So if the model uses Jeffreys prior, what is the $\lambda \sim \text{Gamma}(0.5, 0.00001)$ doing in the model definition? Well, the computational framework underlying Bayesian First Aid is JAGS and in JAGS you build your model using probability distributions. The Jeffreys prior is not a proper probability distribution but it turns out that it can be reasonably well approximated by ${Gamma}(0.5, \epsilon)$ with $\epsilon \rightarrow 0$ (in the same way as $\lambda \propto 1 / \lambda$ can be approximated by ${Gamma}(\epsilon, \epsilon)$ with $\epsilon \rightarrow 0$).
bayes.poisson.test
FunctionThe bayes.poisson.test
function accepts the same arguments as the original poisson.test
function, you can give it one or two counts of events. If you just ran poisson.test(x=14, T=1)
, prepending bayes.
runs the Bayesian First Aid alternative and prints out a summary of the model result (like bayes.poisson.test(x=14, T=1)
). By saving the output, for example, like fit <- bayes.poisson.test(x=14, T=1)
you can inspect it further using plot(fit)
, summary(fit)
and diagnostics(fit)
.
To demonstrate the use of bayes.poisson.test
I will use data from Boice and Monson (1977) on the number of women diagnosed with breast cancer in one group of 1,047 tuberculosis patients that had received on average 102 X-ray exams and one group of 717 tuberculosis patients whose treatment had not required a large number of X-ray exams. Here is the full data set:
Here WY stand for woman-years (as if woman-years would be different from man-years, or person-years…). While the data is from a relatively old article we are going to replicate a more recent reanalysis of that data from the article Testing the Ratio of Two Poisson Rates by Gu et al. (2008). They tested the alternative hypothesis that the rate of breast cancer per person-year would be 1.5 times greater in the group that was X-rayed compared to the control group. They tested it like this:
no_cancer_cases <- c(41, 15)
# person-millennia rather than person-years to get the estimated rate
# on a more interpretable scale.
person_millennia <- c(28.011, 19.025)
poisson.test(no_cancer_cases, person_millennia, r = 1.5, alternative = "greater")
##
## Comparison of Poisson rates
##
## data: no_cancer_cases time base: person_millennia
## count1 = 41, expected count1 = 41.61, p-value = 0.291
## alternative hypothesis: true rate ratio is greater than 1.5
## 95 percent confidence interval:
## 1.098 Inf
## sample estimates:
## rate ratio
## 1.856
and concluded that “There is not enough evidence that the incidence rate of breast cancer in the X-ray fluoroscopy group is 1.5 times to the incidence rate of breast cancer in control group”. It is oh-so-easy to interpret this as that there is no evidence that the incidence rate is more than 1.5 times higher, but this is wrong and the Bayesian First Aid alternative makes this clear:
library(BayesianFirstAid)
bayes.poisson.test(no_cancer_cases, person_millennia, r = 1.5, alternative = "greater")
## Warning: The argument 'alternative' is ignored by bayes.poisson.test
##
## Bayesian Fist Aid poisson test - two sample
##
## number of events: 41 and 15, time periods: 28.011 and 19.025
##
## Estimates [95% credible interval]
## Group 1 rate: 1.5 [1.1, 1.9]
## Group 2 rate: 0.80 [0.43, 1.2]
## Rate ratio (Group 1 rate / Group 2 rate):
## 1.8 [1.1, 3.4]
##
## The event rate of group 1 is more than 1.5 times that of group 2 by a probability
## of 0.754 and less than 1.5 times that of group 2 by a probability of 0.246 .
The warning here is nothing to worry about, there is no need to specify what alternative is tested and bayes.poisson.test
just tells you that. So sure, the evidence is far from conclusive, but given the data and the model there is a 75% probability that the incidence rate is more than 1.5 times higher in the X-rayed group. That is, rather than just saying that there is not enough evidence we have quantified how much evidence there is, and the evidence actually slightly favors the alternative hypothesis. This is also easily seen in the default plot of bayes.poisson.test
:
plot( bayes.poisson.test(no_cancer_cases, person_millennia, r = 1.5) )
Back to the ice cream truck, say that you sold 14 ice creams in one day and your competitors Karl and Anna sold 22 and 7 ice creams, respectively. How would you estimate and compare the underlying rates of sold ice creams of these three trucks when bayes.poisson.test
only accepts counts from two groups? When you want to go off the beaten path the model.code
function is your friend as it takes the result from Bayesian First Aid method and returns R and JAGS code that replicates the analysis you just ran. In this case start by running the model with two counts and then print out the model code:
fit <- bayes.poisson.test(x = c(14, 22), T = c(1, 1))
model.code(fit)
### Model code for the Bayesian First Aid two sample Poisson test ###
require(rjags)
# Setting up the data
x <- c(14, 22)
t <- c(1, 1)
# The model string written in the JAGS language
model_string <- "model {
for(group_i in 1:2) {
x[group_i] ~ dpois(lambda[group_i] * t[group_i])
lambda[group_i] ~ dgamma(0.5, 0.00001)
x_pred[group_i] ~ dpois(lambda[group_i] * t[group_i])
}
rate_diff <- lambda[1] - lambda[2]
rate_ratio <- lambda[1] / lambda[2]
}"
# Running the model
model <- jags.model(textConnection(model_string), data = list(x = x, t = t), n.chains = 3)
samples <- coda.samples(model, c("lambda", "x_pred", "rate_diff", "rate_ratio"), n.iter=5000)
# Inspecting the posterior
plot(samples)
summary(samples)
Just copy-n-paste this code directly into an R script and make the following changes:
x <- c(14, 22)
→ x <- c(14, 22, 7)
t <- c(1, 1)
→ t <- c(1, 1, 1)
for(group_i in 1:2) {
→ for(group_i in 1:3) {
And that’s it! Now we can run the model script and take a look at the estimated rates of ice cream sales for the three trucks.
plot(samples)
If you want to compare many groups you should perhaps consider using a hierarchical Poisson model. (Pro tip: John K. Kruschke’s Doing Bayesian Data Analysis has a great chapter on hierarchical Poisson models.)
Boice, J. D., & Monson, R. R. (1977). Breast cancer in women after repeated fluoroscopic examinations of the chest. Journal of the National Cancer Institute, 59(3), 823-832. link to article (unfortunatel behind paywall)
Gu, K., Ng, H. K. T., Tang, M. L., & Schucany, W. R. (2008). Testing the ratio of two poisson rates. Biometrical Journal, 50(2), 283-298. doi: 10.1002/bimj.200710403, pdf
Villegas, C. (1977). On the representation of ignorance. Journal of the American Statistical Association, 72(359), 651-654. doi: 10.2307/2286233
Lunn, D., Jackson, C., Best, N., Thomas, A., & Spiegelhalter, D. (2012). The BUGS book: a practical introduction to Bayesian analysis. CRC Press. pdf of chapter 5 on Prior distributions
]]>Inspired by events that took place at UseR 2014 last month I decided to implement an app that estimates one’s blood alcohol concentration (BAC). Today I present to you drinkR, implemented using R and Shiny, Rstudio’s framework for building web apps using R. So, say that I had a good dinner, drinking a couple of glasses of wine, followed by an evening at a divy karaoke bar, drinking a couple of red needles and a couple of beers. By entering my sex, height and weight and the times when I drank the drinks in the drinkR app I end up with this estimated BAC curve:
(Now I might be totally off with what drinks I had and when but Romain Francois, Karl Broman, Sandy Griffith, Karthik Ram and Hilary Parker can probably fill in the details.) If you want to estimate your current BAC (or a friend’s…) then head over to the drinkr app hosted at ShinyApps.io. If you want to know how the app estimates BAC read on below. The code for drinkR is available on GitHub, any suggestion on how it can be improved is greatly appreciated.
drinkR estimates the BAC according to the formulas given in The estimation of blood alcohol concentration by Posey and Mozayani (2007). I was also helped by reading through Computer simulation analysis of blood alcohol and the Widmark factor (explained below) was calculated according to The calculation of blood ethanol concentrations in males and females. Unfortunately all these articles are behind paywalls, that is how most publicly funded research works these days…
The BAC estimates you get out of drinkR will be as good as the formulas in Posey and Mozayani (2007). I don’t know how good they are and I don’t know how well they’ll fit you. Estimating BAC is of course a prediction problem and what you really would want to have is data so that you could build a predictive model and get an idea of how well it predicts BAC. Unfortunately I haven’t found any data on this so the Posey and Mozayani formulas is as good as I can do.
Estimating the BAC (according to Posey and Mozayani, 2007) after you have drunken, say, a beer requires “simulating” three processes:
Alcohol absorption. Just because you drank a beer doesn’t mean it goes directly into your blood stream, it has to be absorbed by your digestive system first and this takes some time.
Alcohol distribution. Your BAC depends on how much of you the absorbed alcohol will be “diluted” by. This depends on, among other things, your weight, height and sex.
Alcohol elimintation. How drunk you get (and how soon you will get sober) depends on how fast your body eliminates the absorbed alcohol.
Alcohol absorption can be approximated by assuming it is first order, that is, assuming there is an alcohol halflife, a time it takes for half of a drink to be absorbed. When measured this halflife tend to be between 6 min to 18 min, depending on how much you have reacently eaten. If you haven’t eaten for a while your halflife might be closer to 6 min while if you just had a big döner kebab it might be closer to 18 min.
Alcohol distribution depends on the amount of water that the alcohol in your body will be diluted in. It can be estimated by the following equation:
$$ C = {A \over rW}$$
where $C$ is the alcohol concentration, $A$ is the mass of the alcohol, $W$ is your body weight and $r$ is the Widmark factor. This factor can be seen as an adjustment that is necessary because your whole body is not made of water, thus the alcohol is not “diluted by” your whole weight. There are many different formulas for estimating $r$ and drinkR uses the one given by Seidl et al. (2000) which estimates $r$ dependent on sex, height and weight:
$r_{\text{female}} = 0.31 - 0.0064 \times \text{weight in kg} + 0.0045 \times \text{height in cm}$$r_{\text{male}} = 0.32 - 0.0048 \times \text{weight in kg} + 0.0046 \times \text{height in cm}$
These linear equations can give really strange values for $r$, for example, if you weight a lot. Therefore I also bound $r$ to be within the limits found by Seidl et al. (2000): 0.44 to 0.80 in women and 0.60 to 0.87 in men.
Finally, alcohol elimination can be reasonably approximated by a constant elimination rate of the BAC. This rate can vary from around 0.009 % per hour to 0.035 % per hour with 0.018 % per hour being a reasonable average.
drinkR puts these three processes together and estimates your BAC over time given a number of drinks with time stamps. Assuming that you are also interested in how drunk you are right now, drinkR shows an estimate of your current BAC by fetching your computers local time (see this stackoverflow question for how this is done). The estimate given by drinkR might be very missleading so don’t use it for any serious purposes! To get a sense of the uncertainty in the BAC estimate play around with the parameters (especially the alcohol elimination rate) and see how much your BAC curve changes.
If you want to see how different levels of BAC could affect you see the Progressive effects of alcohol chart over at Wikipedia and if you want to try out drinkR live I would recommend one of my favorite drinks: Absinthe mixed with Orange soda (say Fanta orange). It’s better than you think it is! :)
Posey, D., & Mozayani, A. (2007). The estimation of blood alcohol concentration. Forensic Science, Medicine, and Pathology, 3(1), 33-39. Link (Unfortunately behind paywall)
Rockerbie, D. W., & Rockerbie, R. A. (1995). Computer simulation analysis of blood alcohol. Journal of clinical forensic medicine, 2(3), 137-141. Link (Unfortunately behind paywall)
Seidl, S., Jensen, U., & Alt, A. (2000). The calculation of blood ethanol concentrations in males and females. International journal of legal medicine, 114(1-2), 71-77. Link (Unfortunately behind paywall)
]]>This year’s UseR! conference was held at the University of California in Los Angeles. Despite the great weather and a nearby beach, most of the conference was spent in front of projector screens in 18° c (64° f) rooms because there were so many interesting presentations and tutorials going on. I was lucky to present my R package Bayesian First Aid and the slides can be found here:
There was so much great stuff going on at UseR! and here follows a random sample:
John Chambers on Interfaces, Efficiency and Big Data. One of the creators of S (the predecessor of R) talked about the history of R and exiting new developments such as Rcpp11. He was also kind enough to to sign my copy of S: An Interactive Environment for Data Analysis and Graphics, the original S book from 1984 :)
Yihui Xie the Knitr Ninja. Yihui held the most amazing presentation about how to be a Knitr ninja using only an R script and sound effects. The “anime sword” sound effect used by Yihui is just now available in the development version of beepr
and can be played by running beep("sword")
.
Romain François held both a tutorial and a presentation on the Rcpp11 package, a most convenient way of connecting R and C++.
Dirk Eddelbuettel held a keynote on the topic of R, C++ and Rcpp, another convenient way of connecting R and C++. Do we see a theme here? He also talked about Docker which I never heard of before, which allows sort of light-weight virtual machines which can be easily built and distributed (this is my interpretation, which might be a bit off).
Rstudio was otherwise running the show with great presentation with Winston Chang on ggvis, Joe Cheng on Shiny, J.J. Allaire and Kevin Ushey on Packrat - A Dependency Management System for R, Jeff Allen on The Next Generation of R Markdown and, of course, Hadley Wickham on dplyr: a grammar of data manipulation.
Dieter De Mesmaeker presented a poster on Rdocumentation.org a really nice web-interface to the documentation of R.
All in all, a great conference! I’m already looking forward to next years UseR! conference which will be held at Aalborg University, not too far from where I live (at least compared to LA).
]]>