Peter Norvig, the director of research at Google, wrote a nice essay on How to Write a Spelling Corrector a couple of years ago. That essay explains and implements a simple but effective spelling correction function in just 21 lines of Python. Highly recommended reading! I was wondering how many lines it would take to write something similar in base R. Turns out you can do it in (at least) two pretty obfuscated lines:
sorted_words <- names(sort(table(strsplit(tolower(paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")), "[^a-z]+")), decreasing = TRUE))
correct <- function(word) { c(sorted_words[ adist(word, sorted_words) <= min(adist(word, sorted_words), 2)], word)[1] }
While not working exactly as Norvig’s version it should result in similar spelling corrections:
correct("piese")
## [1] "piece"
correct("ov")
## [1] "of"
correct("cakke")
## [1] "cake"
So let’s deobfuscate the two-liner slightly (however, the code below might not make sense if you don’t read Norvig’s essay first):
# Read in big.txt, a 6.5 mg collection of different English texts.
raw_text <- paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")
# Make the text lowercase and split it up creating a huge vector of word tokens.
split_text <- strsplit(tolower(raw_text), "[^a-z]+")
# Count the number of different type of words.
word_count <- table(split_text)
# Sort the words and create an ordered vector with the most common type of words first.
sorted_words <- names(sort(word_count, decreasing = TRUE))
correct <- function(word) {
# Calculate the edit distance between the word and all other words in sorted_words.
edit_dist <- adist(word, sorted_words)
# Calculate the minimum edit distance to find a word that exists in big.txt
# with a limit of two edits.
min_edit_dist <- min(edit_dist, 2)
# Generate a vector with all words with this minimum edit distance.
# Since sorted_words is ordered from most common to least common, the resulting
# vector will have the most common / probable match first.
proposals_by_prob <- c(sorted_words[ edit_dist <= min(edit_dist, 2)])
# In case proposals_by_prob would be empty we append the word to be corrected...
proposals_by_prob <- c(proposals_by_prob, word)
# ... and return the first / most probable word in the vector.
proposals_by_prob[1]
}
Some thoughts:
adist
function. (A one line spell checker in R is indeed possible using the aspell
function :)sorted_words
vector would be a perfect target for some magrittr magic.NWORDS
variable in order to be able to extract the most probable matching word. This is not necessary in the R code, as we already have a sorted vector we know that the first item always will be the most probable. Still, I believe the two approaches result in the same spelling corrections (but prove me wrong :).HashMap<Integer, String> candidates = new HashMap<Integer, String>();
.Christmas is soon upon us and here are some gift ideas for your statistically inclined friends (or perhaps for you to put on your own wish list). If you have other suggestions please leave a comment! :)
A recently released game where probability takes the main role is Pairs, an easy going press-your-luck game that can be played in 10 minutes. It uses a custom “triangular” deck of cards (1x1, 2x2, 3x3, …, 10x10) and is a lot of fun to play, highly recommended!
Another good gift would be a pound of assorted dice together with the seminal Dice Games Properly Explained by Reiner Knizia. While perhaps not a game, a cool gift to someone that already has a pound of dice would be a set of Non transitive Grime dice.
A search for statistics and mugs or statistics and t-shirts results in a lot of good gifts, for example this t-test mug:
You could also support your favorite MCMC software by buying a STAN themed mug from their shop or why not come up with a custom layout yourself? (I’ve used Vistaprint before and those mugs turned out decent and cheap.)
R, Python and Julia are great tools that are perhaps becoming a bit too mainstream for the self-conscious data science hipster. Why not then give the joy of some retro calculation? Slide rulers are amazingly cool and while I don’t know if new are made they can be gotten cheap on ebay. The same goes for vintage pocket calculators (make sure to get one where the digits are in bright green or red). There is also the 50s book with the self describing title A Million Random Digits. (Don’t miss the hilarious reviews on Amazon!)
The XKCD web comic by Randall Munroe often touches upon statistical issues and while his recent book What If?: Serious Scientific Answers to Absurd Hypothetical Questions is not statistical per see, it contains heaps of amusing back-of-the-envelope calculations. You can also get signed prints of some of the comics, for example of #231 “Cat Proximity”:
I love comic books teaching statistics (which I’ve written about earlier) and my two favorites are The Cartoon Guide to Statistics by Larry Gonick and Woollcott Smith and The Manga Guide to Statistics by Shin Takahashi. Both are great in their own ways and are enjoyable both if you are a statistics padawan or already a master of the dark arts.
NausicaaDistribution sells cool stuff such as a Standard Normal Distribution Plushie, an Evil Cauchy Distribution Plushie and a lot more distributions of different shapes and alignments.
You can also buy a My First Number Sets Wood Puzzle for the budding number theoretician or Famous Statistician Embroidered Coasters to save the sofa table from the eggnog.
Here are some good popular science books that deals with different aspects of statistics and that anybody can enjoy:
Dataclysm: Who We Are (When We Think No One’s Looking) by Christian Rudder. I have not actually ready this (hint hint siblings…) but it’s bound to be good as it is written by the guy behind the OKcupid blog.
The Theory That Would Not Die by Sharon Bertsch McGrayne. The history of Bayes theorem and Bayesian statistics, contains almost no math but is fun and engaging anyway.
The Lady Tasting Tea by David Salsburg gives a more “classical” perspective on the history of statistics.
The Signal and the Noise by Nate Silver. If you know someone who, against all odds, haven’t already read this book then it is a great way to get that someone interested in statistics and data analysis.
Here are some slightly more serious books that I have enjoyed:
Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan by John Kruschke. The book that got me started with Bayesian data analysis, a pedagogical masterpiece that recently received a second edition.
Advanced R by Hadley Wickham. A great guide to serious R programming which is also freely available online (but is then slightly more difficult to gift-wrap…).
The Visual Display of Quantitative Information by Edward R. Tufte. This classic makes a great gift, not least because of its almost coffee table book like properties.
Making a slight digression from last month’s Probable Points and Credible Intervals here is how to summarize a 2D posterior density using a highest density ellipse. This is a straight forward extension of the highest density interval to the situation where you have a two-dimensional posterior (say, represented as a two column matrix of samples
) and you want to visualize what region, containing a given proportion of the probability, that has the most probable parameter combinations. So let’s first have a look at a fictional 2D posterior by using a simple scatter plot:
plot(samples)
Whoa… that’s some serious over-plotting and it’s hard to see what’s going on. Sure, the bulk of the posterior is somewhere in that black hole, but where exactly and how much of it?
A highest posterior density ellipse shows this by covering the area that contains the most probable parameter combinations while containing p% of the posterior probability. Like finding the highest density interval corresponds to finding the shortest interval containing p% of the probability, finding the highest density ellipse corresponds to finding the smallest ellipse containing p% of the probability a.k.a. the minimum volume ellipse. I have spent a lot of time trying to figure out how compute minimum volume ellipses. Wasted time, it turns out, as it can be easily computed using packages that come with R, you just have to know what you are looking for. If you just want the code skip over the next paragraph, if you want to know the tiny bit of detective work I had to do to figure this out, read on.
To find the points in sample
that are included in a minimum volume ellipse covering, say, 75% of the samples you can use cov.mve(samples, quantile.used = nrow(samples) * 0.75)
from the MASS package, here quantile.used
specifies the number of points in samples
that should be inside the ellipse. It uses an approximation algorithm described by Van Aelst, S. and Rousseeuw, P. (2009) that is not guaranteed to find the minimum volume ellipse but that will often be pretty close. A problem is that cov.mve
does not return the actual ellipse, it returns a robustly measured covariance matrix, but that’s not really what we are after. It does return an object that contains the indices of the points that are covered by the minimum volume ellipse, if fit
is the object returned by cov.mve
then these points can be extracted like this: points_in_ellipse <- samples[fit$best, ]
. To find the ellipse we are going to use ellipsoidhull
from the cluster package on the points_in_ellipse
. It returns an object which represents the minimum volume ellipse and by using its predict
function we get a two column matrix with points that lie on the hull of the ellipse and that we can finally plot.
That wasn’t too easy to figure out, but it’s pretty easy to do. The code below plots a 75% minimum volume / highest density ellipse:
library(MASS)
library(cluster)
# Finding the 75% highest density / minimum volume ellipse
fit <- cov.mve(samples, quantile.used = nrow(samples) * 0.75)
points_in_ellipse <- samples[fit$best, ]
ellipse_boundary <- predict(ellipsoidhull(points_in_ellipse))
# Plotting it
plot(samples, col = rgb(0, 0, 0, alpha = 0.2))
lines(ellipse_boundary, col="lightgreen", lwd=3)
legend("topleft", "50%", col = "lightgreen", lty = 1, lwd = 3)
Looking at this new plot we see that for the bulk of the probability mass the parameters are correlated. This correlation was not really visible in the naive scatter plot. If you rerun this code many times you will notice that the ellipse changes position slightly each time. This is due to cov.mve
using an non-exact algorithm. If you have a couple of seconds to spare you can make cov.mve
more exact by setting the parameter nsamp
to a large number, say nsamp = 10000
.
You are, of course, not limited to drawing just outlines and if you want to draw shaded ellipses you can use the polygon
function. The code below draws three shaded highest density ellipses of random color with coverages of 95%, 75% and 50%.
plot(samples, col = rgb(0, 0, 0, alpha = 0.2))
for(coverage in c(0.95, 0.75, 0.5)) {
fit <- cov.mve(samples, quantile.used = nrow(samples) * coverage)
ellipse_boundary <- predict(ellipsoidhull(samples[fit$best, ]))
polygon(ellipse_boundary, col = sample(colors(), 1), border = NA)
}
Looks like modern aRt to me!
The function bellow adds a highest density ellipse to an existing plot created using base graphics:
# Adds a highest density ellipse to an existing plot
# xy: A matrix or data frame with two columns.
# If you have to variables just cbind(x, y) them.
# coverage: The percentage of points the ellipse should cover
# border: The color of the border of the ellipse, NA = no border
# fill: The filling color of the ellipse, NA = no fill
# ... : Passed on to the polygon() function
add_hd_ellipse <- function(xy, coverage, border = "blue", fill = NA, ...) {
library(MASS)
library(cluster)
fit <- cov.mve(xy, quantile.used = round(nrow(xy) * coverage))
points_in_ellipse <- xy[fit$best, ]
ellipse_boundary <- predict(ellipsoidhull(points_in_ellipse))
polygon(ellipse_boundary, border=border, col = fill, ...)
}
So to replicate the above plot with the 75% highest density ellipse you could now write:
plot(samples)
add_hd_ellipse(samples, coverage = 0.75, border = "lightgreen", lwd=3)
Obviously, a highest density ellipse is only going to work well if the posterior is roughly elliptical. If this is not the case, an alternative is to use a 2D kernel density estimator on the samples
and trace out the coverage boundaries. The function HPDregionplot
in the emdbook package does exactly this:
library(emdbook)
plot(samples, col=rgb(0, 0, 0, alpha = 0.2))
HPDregionplot(samples, prob = c(0.95, 0.75, 0.5), col=c("salmon", "lightblue", "lightgreen"), lwd=3, add=TRUE)
legend("topleft", legend = c("95%", "75%", "50%"), col = c("salmon", "lightblue", "lightgreen"), lty=c(1,1,1), lwd=c(3,3,3))
You could also plot a 2d histogram of the samples
, for example, using the hexagon plot in ggplot2:
qplot(samples[,1], samples[,2], geom=c("hex"))
However you would have to work a bit with the color scheme if you wanted the colors to correspond to a given coverage.
Finally, if you plot a 2D density it could also be useful to add marginal density plots, as is done in the default plot for the Bayesian First Aid alternative to the correlation test. Here with completely fictional data on the number of shotguns and the number of zombie attacks per state in the U.S:
library(BayesianFirstAid)
fit <- bayes.cor.test(no_zombie_attacks, no_shotguns_per_1000_persons)
plot(fit)
Van Aelst, S. and Rousseeuw, P. (2009), Minimum volume ellipsoid. Wiley Interdisciplinary Reviews: Computational Statistics, 1: 71–82. Doi: 10.1002/wics.19, link to the paper (unfortunately behind paywall)
]]>Why R? Because S!
R is the open source implementation (and a pun!) of S, a language for statistical computing that was developed at Bell Labs in the late 1970s. After that, the implementation of S underwent a number of major revisions documented in a series of seminal books, often just referred to by the color of their cover: The Brown Book, the Blue Book, the White Book and the Green Book. To satisfy my techno-historical lusts I recently acquired all these books and I though I would share some tidbits from them, highlighting how S (and thus R) developed into what we today love and cherish. But first, here are the books in chronological order from left to right:
Most of these are out of print, but all can be bought second hand on, for example, Amazon (where is where I got them and where the links above lead).
by Richard A. Becker and John M. Chambers
This book from 1984 describes not the first version of S, but the second (S2) according to the versioning used here by Chambers. It describes a language that is very similar to modern R (but also very different). We recognize friends like c
…
… and plot
:
But note that plot
was only for scatter plots and was not a generic function producing different types of plots as in modern R. This, because S didn’t yet have objects and classes. S had, however, state of the art graphing capabilities from the start, implementing the plot types described in Graphical Methods for Data Analysis (1983) (also written by John M. Chambers and which I’ve written about here). For example, the very useful pairs
function was already there:
While many things were similar to modern R, not everything was. For one thing, you could not define your own functions! Instead you would have to rely on macros:
Here ?T
in the macro is another macro producing a temporary variable name in order to not clash with any global variable name, crazy!
We also find answer to why some of the peculiarities of modern R exists. Have you ever wondered why many function and parameter names in R are period.separated
rather than underscored_spearated
? Well, because in S2 underscore was an alias for <-
!
On to a surprise finding… Rstudio are doing great things and for a while it has been possible to make slides using R markdown in Rstudio. Is this great? Sure! Is it new? Nope… :) Slide construction was already easy to do in S anno 1984 using the vu
function. This function took a string written in a special markup language…
… and produced slides on the graphic device, such as this:
Unfortunately vu
didn’t make it all the way to modern R.
I don’t want to brag, but I’m gonna do it anyway: I recently got my copy of the “Brown book” signed by John Chambers himself at the UseR 2014 conference! :D
S: An Interactive Environment for Data Analysis and Graphics on Amazon
by Richard A. Becker and John M. Chambers
This book is not part of the color book canon, but I’ll include it for completeness anyway. Published the year after the Brown book, it describes how to implement new functions in S. However, as S only had support for macros, these functions would have to be written in another language (say FORTRAN) and then connected to S using a special interface language:
While not relevant to modern R, this interface language is the “ancestor” of modern day interfaces such as Rcpp and Rcpp11.
Extending The S System on Amazon
by Richard A. Becker, John M. Chambers and Allan R. Wilks
This book introduces S version three (S3) which was a major revision of S2. While S2 was primarily programmed in FORTRAN, S3 was mainly done in C. The interface language was now gone and instead C functions could be directly invoked from S functions. But what’s more, users could now easily define functions themselves!
Functions were also first class citizens and could be passed around thus enabling the modern apply
type functions:
Computation on the language was also now possible, for example, by using substitute
. Some things were still different from modern day R, take a look at the following statement:
Why lottery.number
and lottery.payoff
instead of lottery$number
and lottery$payoff
? Because data.frames
didn’t yet exist! (Though it would still have been possible to stick two vectors inside a list.)
The New S Language: A Programming Environment for Data Analysis and Graphics on Amazon
edited by John M. Chambers and Trevor J. Hastie
This book “completes” the specification of S3 with three biggies: (1) data frames, (2) formulas…
… and (3) object orientation:
While the earlier books are more focused on graphics and programming, this book is all about statistical models (the title of the book might be a hint). Here we get introduced to workhorses like glm
, gam
, nls
, tree
and, not to forget, lm
:
There is, however, no mention of the classical *.test
functions such as t.test
, binom.test
and cor.test
(Do anybody know when they appeared in S/R?). The focus is also more on prediction and estimation rather than testing, for example, p-values were not reported as part of summary.lm
(which they are in modern R):
Other things that are new are ?
, which can now be used to look up help pages, and that there is a new datatype called factor
. And already from the start read.table
converted all strings to factors by default. :) All in all, this book was interesting to read and is still, I believe, a very good introduction to the formula interface and the lm
/glm
/gam
type functions.
Statistical Models in S on Amazon
by John M. Chambers
This book describes S version four and focuses almost exclusively on programming and not so much on stats and graphics. A big change from S3 was the introduction of a new, more formal, system for object oriented programming:
Other than that there weren’t any eye catching differences from S version 3. One small thing to note is that =
could now be used for assignment instead of <-
and is actually used consistently throughout the book:
Programming with Data: A Guide to the S Language on Amazon
That was all I had. If you are further interested in the history of S and R I also recommend A brief history of S (Becker, 1994) , Stages in the Evolution of S (Chamers, 2000) and R: Past and future history (Ihaka, 1998).
All images and quotes included in this review are copyrighted by their respective copyrighted holders, however I believe that the inclusion of these quotes and images in in this review constitutes fair use.
]]>