Does pill A or pill B save the most lives? Which web design results in the most clicks? Which in vitro fertilization technique results in the largest number of happy babies? A lot of questions out there involves estimating the proportion or relative frequency of success of two or more groups (where success could be a saved life, a click on a link, or a happy baby) and there exists a little known R function that does just that, prop.test
. Here I’ll present the Bayesian First Aid version of this procedure. A word of caution, the example data I’ll use is mostly from the Journal of Human Reproduction and as such it might be slightly NSFW :)
Bayesian First Aid is an attempt at implementing reasonable Bayesian alternatives to the classical hypothesis tests in R. For the rationale behind Bayesian First Aid see the original announcement. The development of Bayesian First Aid can be followed on GitHub. Bayesian First Aid is a work in progress and I’m grateful for any suggestion on how to improve it!
The Model
This is a straight forward extension of the Bayesian First Aid alternative to the binomial test which can be used to estimate the underlying relative frequency of success given a number of trials and, out of them, a number of successes. The model for bayes.prop.test
is just more of the same thing, we’ll just estimate the relative frequencies of success for two or more groups instead. Below is the full model where $\theta_i$ is the relative frequency of success estimated given $x_i$ successes out of $n_i$ trials:
The bayes.prop.test
Function
The bayes.prop.test
function accepts the same arguments as the original prop.test
function, you can give it two vectors one with counts of successes and one with counts of trials or you can supply the same data as a matrix with two columns. If you just ran prop.test(successes, trials)
, prepending bayes.
(like bayes.prop.test(successes, trials)
) runs the Bayesian First Aid alternative and prints out a summary of the model result. By saving the output, for example, like fit < bayes.prop.test(successes, trials)
you can inspect it further using plot(fit)
, summary(fit)
and diagnostics(fit)
.
To demonstrate the use of bayes.prop.test
I will use data from the Kinsey Institute for Research in Sex, Gender and Reproduction as described in the article Genital asymmetry in men by Bogaert (1997). The data consists of survey answers from 6544 “postpubertal males with no convictions for felonies or misdemeanours” where the respondents, among other things, were asked two questions:
 Are you right or left handed?
 If you were standing up and had an erection, if you looked down at the penis, would it be pointing straight ahead or pointed somewhat to the right or left?
I don’t know about you, but the first question I had was are right handed people more like to have it on the right, or perhaps the oposite? Here is the raw data given by Bogaert:
Seems like we have 4794 complete cases. Just looking at those with right or leftward disposition leaves us 1624 cases with 275 right leaners out of 1454 righthanders and 43 right leaners out of 170 nonrighthanders. Bogaert uses a chisquare test to analyze this data but since we are interested in comparing proportions we’ll start with using prop.test
instead:
1 2 3 4 

1 2 3 4 5 6 7 8 9 10 11 12 

Not too bad, we get both a confidence interval on the difference between the groups and maximum likelihood estimates. Let’s compare this with the Bayesian First Aid version:
1


1 2 3 4 5 6 7 8 9 10 11 12 13 

Pretty similar estimates (and they should be) but now we also get estimates [and credible intervals] for both groups and the group difference. Looking at the estimates it seems like both left and righthanders tend to lean to the left much more often than to the right. We also get to know that the probability that lefthanders lean more often to the right compared to righthanders is 97.7%. Interesting! Let’s look at this relation further by plotting the posterior distribution:
1 2 

So, it is most likely that lefthanders lean more to the right by around 67 percentage points (and Bogaert discusses some reasons why this might be the case). Still the posterior of the group difference is pretty wide (and the credible interval kisses the 0.0) and even though the analysis is leaning (he he) towards there being a difference it would be nice to have a few more data points to get a tighter estimate.
Using bayes.prop.test
as a Replacement for chisq.test
I don’t like Pearson’s chisquared test, it is used as a catchall analysis for any tables of counts and what you get back is utterly uninformative: a pvalue relating to the null hypothesis that the row variable is completely independent of the column variable (which is anyway known to be false a priori most of the time, see here and here for some discussion). If you have counts of successes for many groups and are interested in actually estimating group differences bayes.prop.test
can also be used as a replacement for chisq.test
. Let’s look at an example, again with data from the Journal of Human Reproduction.
When doing in vitro fertilization the egg is fertilized by sperm outside the body and later, if successfully fertilized, reinserted into the uterus. The egg and the sperms are usually left together to coincubate for more than an hour before the egg is separated from the sperm and left to incubate for itself. Bungum et al. (2006) was interested in comparing if an ultrashort coincubation period of 30 s. would work as well as a more conventional coincubation period of 90 m. Bungum et al. used a 30 s. coincubation period on 389 eggs and a 90 min. period on another batch of 388 eggs and looked at a number of measures such as the number of fertilized eggs and the number of resulting embryos graded as high quality. Their analysis of the result is summarized in the table below:
Unfortunately they use chisquare tests to analyze these counts and they don’t even report the full pvalues, for all but one of the measures all we get to know is NS. Just looking at the raw data it seems like there is little difference between the proportions of fertilized eggs in the two groups, but there seems to be a difference in embryo quality with more embryos in the 90 min. group being of high quality (defined as grade 0 and 1). But the chisquare analysis says NS which is interpreted in the result section as: “the two groups were comparable”. Let’s analyze the data with bayes.prop.test
:
1 2 3 4 

Looking at the posterior it seems like there is actually some evidence for that the 30 s. procedure results in fewer embryos of good quality as most of the posterior probability is centered around a difference of 612% percentage points. Sure, the credible interval kisses zero, but the evidence for a small difference, which was hinted at in the original article, is definitely not strong.
Using the concept of a region of practical equivalence (ROPE) we can calculate the probability that the difference between the two procedures is small. First we have to decide what would count as a small enough difference to be negligible. I have no strong intuition about what would be a small difference in this particular case, so I’m arbitrarily going to go with 5 percentage points, yielding a ROPE of [5, 5] percentage points (for more about ROPEs see Kruschke, 2011). To calculate the probability that the difference between the two groups is within the ROPE I’ll extract the MCMC samples generated when the model was fit using as.data.frame
and then I’ll use them to calculate the probability “by hand”:
1 2 

1


The probability that the relative frequency of high quality embryos is practically equivalent between the two procedures is only 20%, thus the probability that there is a substantial difference is 80%. There is definitely weak evidence for “no difference” here but we would need more data to be able state the magnitude of the difference with reasonable certainty.
Caveat: I know very little about in vitro fertilization and this is definitely not a critique of the study in any way. I don’t know what would be considered a region of practical equivalence in this case and I don’t know if embryo quality is considered an important outcome. However, I still believe that the analysis would have been more informative if they would have used something better than chisquare tests and pvalues!
Comparing Many Groups
Like prop.test
, bayes.prop.test
can be used to compare more than two groups. Here is an example with a dataset from the prop.test
help file on the number of smokers in four groups of patients with lung cancer.
1 2 3 4 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

1


As is shown both in the print out and in the plot, group 4 seems to differ slightly from the rest. While bayes.prop.test
can be used to compare more than four groups both the printouts and the plots start to get a bit overwhelming when there are too many groups. A remedy for this is to use the model.code
function that prints out JAGS and R code that replicates the model you have fitted with bayes.prop.test
and to customize this code further to make the plots and comparisons you are interested in.
1


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

Another reason to modify the code that is printed out by model.code
is in order to change the assumptions of the model. The current model does not assume any dependency between the groups and if this is an unreasonable assumption you might want to modify the model code to include such a dependency. A nice example of how to extend the model to assume a hierarchical dependency between the relative frequencies of success of each group can be found on the LingPipe blog. Hierarchical binomial models are also discussed in chapter 9 in Kruschke’s Doing Bayesian Data Analysis and in section 5.3 in Gelman et al.’s Bayesian Data Analysis.
References
Bogaert, A. F. (1997). Genital asymmetry in men. Human reproduction, 12(1), 6872. doi: 10.1093/humrep/12.1.68 . pdf
Bungum, M., Bungum, L., & Humaidan, P. (2006). A prospective study, using sibling oocytes, examining the effect of 30 seconds versus 90 minutes gamete coincubation in IVF. Human Reproduction, 21(2), 518523. doi: 10.1093/humrep/dei350
Kruschke, J. K. (2011). Bayesian assessment of null values via parameter estimation and model comparison. Perspectives on Psychological Science, 6(3), 299312. doi: 10.1177/1745691611406925 . pdf