Everybody loves speed comparisons! Is R faster than Python? Is dplyr
faster than data.table
? Is STAN faster than JAGS? It has been said that speed comparisons are utterly meaningless, and in general I agree, especially when you are comparing apples and oranges which is what I’m going to do here. I’m going to compare a couple of alternatives to lm()
, that can be used to run linear regressions in R, but that are more general than lm()
. One reason for doing this was to see how much performance you’d loose if you would use one of these tools to run a linear regression (even if you could have used lm()
). But as speed comparisons are utterly meaningless, my main reason for blogging about this is just to highlight a couple of tools you can use when you grown out of lm()
. The speed comparison was just to lure you in. Let’s run!
Below are the seven different methods that I’m going to compare by using each method to run the same linear regression. If you are just interested in the speed comparisons, just scroll to the bottom of the post. And if you are actually interested in running standard linear regressions as fast as possible in R, then Dirk Eddelbuettel has a nice post that covers just that.
lm()
This is the baseline, the “default” method for running linear regressions in R. If we have a data.frame
d
with the following layout:
head(d)
## y x1 x2
## 1 -64.579 -1.8088 -1.9685
## 2 -19.907 -1.3988 -0.2482
## 3 -4.971 0.8366 -0.5930
## 4 19.425 1.3621 0.4180
## 5 -1.124 -0.7355 0.4770
## 6 -12.123 -0.9050 -0.1259
Then this would run a linear regression with y
as the outcome variable and x1
and x2
as the predictors:
lm(y ~ 1 + x1 + x2, data=d)
##
## Call:
## lm(formula = y ~ 1 + x1 + x2, data = d)
##
## Coefficients:
## (Intercept) x1 x2
## -0.293 10.364 21.225
glm()
This is a generalization of lm()
that allows you to assume a number of different distributions for the outcome variable, not just the normal distribution as you are stuck with when using lm()
. However, if you don’t specify any distribution glm()
will default to using a normal distribution and will produce output identical to lm()
:
glm(y ~ 1 + x1 + x2, data=d)
##
## Call: glm(formula = y ~ 1 + x1 + x2, data = d)
##
## Coefficients:
## (Intercept) x1 x2
## -0.293 10.364 21.225
##
## Degrees of Freedom: 29 Total (i.e. Null); 27 Residual
## Null Deviance: 13200
## Residual Deviance: 241 AIC: 156
bayesglm()
Found in the arm
package, this is a modification of glm
that allows you to assume custom prior distributions over the coefficients (instead of the implicit flat priors of glm()
). This can be super useful, for example, when you have to deal with perfect separation in logistic regression or when you want to include prior information in the analysis. While there is bayes in the function name, note that bayesglm()
does not give you the whole posterior distribution, only point estimates. This is how to run a linear regression with flat priors, which should give similar results as when using lm()
:
library(arm)
bayesglm(y ~ 1 + x1 + x2, data = d, prior.scale=Inf, prior.df=Inf)
##
## Call: bayesglm(formula = y ~ 1 + x1 + x2, data = d, prior.scale = Inf,
## prior.df = Inf)
##
## Coefficients:
## (Intercept) x1 x2
## -0.293 10.364 21.225
##
## Degrees of Freedom: 29 Total (i.e. Null); 30 Residual
## Null Deviance: 13200
## Residual Deviance: 241 AIC: 156
nls()
While lm()
can only fit linear models, nls()
can also be used to fit non-linear models by least squares. For example, you could fit a sine curve to a data set with the following call: nls(y ~ par1 + par2 * sin(par3 + par4 * x ))
. Notice here that the syntax is a little bit different from lm()
as you have to write out both the variables and the parameters. Here is how to run the linear regression:
nls(y ~ intercept + x1 * beta1 + x2 * beta2, data = d)
## Nonlinear regression model
## model: y ~ intercept + x1 * beta1 + x2 * beta2
## data: d
## intercept beta1 beta2
## -0.293 10.364 21.225
## residual sum-of-squares: 241
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 3.05e-08
mle2()
In the bblme
package we find mle2()
, a function for general maximum likelihood estimation. While mle2()
can be used to maximize a handcrafted likelihood function, it also has a formula interface which is simple to use, but powerful, and that plays nice with R’s built in distributions. Here is how to roll a linear regression:
library(bbmle)
inits <- list(log_sigma = rnorm(1), intercept = rnorm(1),
beta1 = rnorm(1), beta2 = rnorm(1))
mle2(y ~ dnorm(mean = intercept + x1 * beta1 + x2 * beta2, sd = exp(log_sigma)),
start = inits, data = d)
##
## Call:
## mle2(minuslogl = y ~ dnorm(mean = intercept + x1 * beta1 + x2 *
## beta2, sd = exp(log_sigma)), start = inits, data = d)
##
## Coefficients:
## log_sigma intercept beta1 beta2
## 1.0414 -0.2928 10.3641 21.2248
##
## Log-likelihood: -73.81
Note, that we need to explicitly initialize the parameters before the maximization and that we now also need a parameter for the standard deviation. For an even more versatile use of the formula interface for building statistical models, check out the very cool rethinking
package by Richard McElreath.
optim()
Of course, if we want to be really versatile, we can craft our own log-likelihood function to maximized using optim()
, also part of base R. This gives us all the options, but there are also more things that can go wrong: We might make mistakes in the model specification and if the search for the optimal parameters is not initialized well the model might not converge at all! A linear regression log-likelihood could look like this:
log_like_fn <- function(par, d) {
sigma <- exp(par[1])
intercept <- par[2]
beta1 <- par[3]
beta2 <- par[4]
mu <- intercept + d$x1 * beta1 + d$x2 * beta2
sum(dnorm(d$y, mu, sigma, log=TRUE))
}
inits <- rnorm(4)
optim(par = inits, fn = log_like_fn, control = list(fnscale = -1), d = d)
## $par
## [1] 1.0399 -0.2964 10.3637 21.2139
##
## $value
## [1] -73.81
##
## $counts
## function gradient
## 431 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
As the convergence
returned 0
it hopefully worked fine (a 1
indicates non-convergence). The control = list(fnscale = -1)
argument is just there to make optim()
do maximum likelihood estimation rather than minimum likelihood estimation (which must surely be the worst estimation method ever).
optimizing()
Stan is a stand alone program that plays well with R, and that allows you to specify a model in Stan’s language which will compile down to very efficient C++ code. Stan was originally built for doing Hamiltonian Monte Carlo, but now also includes an optimizing()
function that, like R’s optim()
, allows you to do maximum likelihood estimation (or maximum a posteriori estimation, if you explicitly included priors in the model definition). Here we need to do a fair bit of work before we can fit a linear regression but what we gain is extreme flexibility in extending this model, would we need to. We have come a long way from lm
…
library(rstan)
## Loading required package: inline
##
## Attaching package: 'inline'
##
## The following object is masked from 'package:Rcpp':
##
## registerPlugin
##
## rstan (Version 2.6.0, packaged: 2015-02-06 21:02:34 UTC, GitRev: 198082f07a60)
##
## Attaching package: 'rstan'
##
## The following object is masked from 'package:arm':
##
## traceplot
model_string <- "
data {
int n;
vector[n] y;
vector[n] x1;
vector[n] x2;
}
parameters {
real intercept;
real beta1;
real beta2;
real<lower=0> sigma;
}
model {
vector[n] mu;
mu <- intercept + x1 * beta1 + x2 * beta2;
y ~ normal(mu, sigma);
}
"
data_list <- list(n = nrow(d), y = d$y, x1 = d$x1, x2 = d$x2)
model <- stan_model(model_code = model_string)
fit <- optimizing(model, data_list)
fit
## $par
## intercept beta1 beta2 sigma
## -0.2929 10.3642 21.2248 2.8331
##
## $value
## [1] -46.24
So, just for fun, here is the speed comparison, first for running a linear regression with 1000 data points and 5 predictors:
This should be taken with a huge heap of salt (which is not too good for your health!). While all these methods produce a result equivalent to a linear regression they do it in different ways, and not necessary in equally good ways, for example, my homemade optim
routine is not converging correctly when trying to fit a model with too many predictors. As I have used the standard settings there is surely a multitude of ways in which any of these methods can be made faster. Anyway, here is what happens if we vary the number of predictors and the number of data points:
To make these speed comparisons I used the microbenchmark
package, the full script replicating the plots above can be found here. This speed comparison was made on my laptop running R version 3.1.2, on 32 bit Ubuntu 12.04, with an average amount of RAM and a processor that is starting to get a bit tired.
“Behind every great point estimate stands a minimized loss function.” – Me, just now
This is a continuation of Probable Points and Credible Intervals, a series of posts on Bayesian point and interval estimates. In Part 1 we looked at these estimates as graphical summaries, useful when it’s difficult to plot the whole posterior in good way. Here I’ll instead look at points and intervals from a decision theoretical perspective, in my opinion the conceptually cleanest way of characterizing what these constructs are.
If you don’t know that much about Bayesian decision theory, just chillax. When doing Bayesian data analysis you get it “pretty much for free” as esteemed statistician Andrew Gelman puts it. He then adds that it’s “not quite right because it can take effort to define a reasonable utility function.” Well, perhaps not free, but it is still relatively straight forward! I will use a toy problem to illustrate how Bayesian decision theory can be used to produce point estimates and intervals. The problem is this: Our favorite droid has gone missing and we desperately want to find him!
Robo went missing 23:00 yesterday and haven’t been seen since. We know he disappeared somewhere within a 120 miles long strip of land and we are going to mount a search operation. Our top scientists have been up all night analyzing the available data and we just received the result: the probability of Robo being in different locations.
So, this is (in Bayesian lingo) a posterior distribution, the probability of different “states” after having analyzed the available data. Here the “state” is the location of Robo and looking at the posterior above it seems like he could be in a lot of places. Most likely he is in the forest, somewhere between 75 and 120 miles from the reference point (arbitrarily set to the left most position on the map). He might also be hiding in the plains, either around the 15th or the 40th mile. It’s not that likely that he’s in the mountains, but we can’t dismiss it altogether.
- So, where should we start looking for Robo?
- Well, that depends…
- Depends on what?
- Your loss function.
A loss function is some method of calculating how bad a decision would be if the world is in a certain state. In our case the state is the location of Robo, the decision could be where to start looking for him, and badness could be the time it will take to find him (we want to find him fast!). If we knew the state of the world, we could find the best decision: the decision that minimizes the loss. Now, we don’t actually know that state but, if we have a Bayesian model that we believe does a good job, we can use the resulting posterior to represent our knowledge about that state. That is, we are going to plug in a possible decision, and a posterior distribution, to our loss function and the result will be a probability distribution over how large the loss might be. Doing this is really easy, especially if the posterior is represented as a sample of values (which is almost always the case when doing Bayesian data analysis anyway). Of course, we could skip a formal decision analysis and just look at the posterior and make a non-formalized decision. In many case that might be the preferred course, but it’s not why we are here today.
So we call our science team up and ask them to send over the posterior represented as a large sample of positions, let’s call that list s
. Here are the first 16 samples in s
:
head(s, n = 16)
## [1] 15 101 41 89 14 41 83 112 33 33 94 104 88 82 77 18
As expected, these values are mostly clustered around the 15th, 40th and 90th mile. Say that our loss function is the distance from where we start searching to the location of Robo, and our decision is to start the search at the 60th mile. To get a probability distribution for the loss we simply apply the loss function to each sample in s
. For the first sample the loss is abs(15 - 60)
= 45, for the second sample it’s abs(101 - 60)
= 41, and so on. Below is the resulting posterior loss, given the decision to start searching at the 60th mile:
The plot above also show the expected loss (here the mean of the posterior distance from the 60th mile). This is a common measure of how good a decision is and the final step in a decision analysis would be to find the decision that minimizes the expected loss. And that’s it! As Gelman mentioned, the hard part is defining a reasonable loss function, but once you have done that, it’s straight forward to find the decision that minimizes the expected loss.
The rest of the post is dedicated to showing how one can define different loss functions for the “Where’s Robo?” scenario. I will start out with some simple loss functions that result in point estimates and end with some more complicated loss function that result in interval estimates.
A Bayesian point estimate is the result of a decision analysis where you (or perhaps your computer) have found the best point/location/value given a posterior distribution and some loss function. Meanwhile, the management have decided that the search party will be deployed by helicopter and that, once on the ground, it will split into two groups, one searching to the right and one searching to the left. Now we only need to decide where to start the search for Robo. Thus, we desperately need a Bayesian point estimate and to get that we need a loss function!
When it comes to loss functions there are three usual suspects: squared loss, absolute loss and 0-1 loss (also known as L2, L1 and L0 loss). We’ve already seen absolute loss (L1), it’s when the loss is the distance between the decision and the state of the world, that is, the absolute value of the difference between the point x
and the state s
. In R code:
absolute_loss <- function(x, s) {
mean(abs(x - s))
}
Due to taking the mean
, this function will also return the expected absolute loss when s
is a posterior sample. For our present purpose this is a pretty decent loss function. Assuming that the two search groups walk at constant speed, this function will minimize the expected time/cost it takes to find Robo.
Squared loss (L2) is another common loss function:
squared_loss <- function(x, s) {
mean((x - s)^2)
}
Using this loss function would mean that we consider it four times as bad if it takes twice the time to find Robo (again assuming the two search groups walk at constant speed). So squared loss might not make that much sense for the present problem.
The last loss function is 0-1 loss (L0) which assigns zero loss to a decision that is correct and one loss to an incorrect decision. Given this loss function the best decision is to choose the most probable state. This loss function make sense if you are, say, defusing a bomb and need to choose between the green, blue and red wire (if you make the right decision = no loss of limbs, cut the wrong wire = Boom!). When searching for Robo it doesn’t really make sense to say that starting the search 1 mile from Robo’s location is as bad as starting it on the moon. As Robo’s position is a real number, the posterior probability of him being in any specific position is practically zero. In this continuous case we can instead use the posterior probability density. If you have a sample from the posterior (s
) then the density can be approximated using density(s)
. As the resulting density is given at discrete points we have to use approx
to interpolate the density at the decision x
, and we have to negate the resulting density estimate to turn this into a loss. Here is the whole function:
zero_one_loss <- function(x, s) {
# This below is proportional to 0-1 loss.
- approx(density(s), xout = x)$y
}
So, where should we start the search operation according to these three loss functions? To figure this out we just need to determine what decision minimizes the expected loss. This can be done in more or less intelligent ways, but I went brute force and just tried all positions from the 0th to the 120th mile. Here are the resulting point estimates (with the loss functions below):
According to the absolute loss criteria (L1) we should start looking in the forest, according to the quadratic loss (L2) we should start in the mountains and 0-1 loss (L0) goes for the single most probable location at the 15th mile. The way with which I found these point estimates is very general, evaluate the loss function all possible decisions (or a representative sample) and pick the decision with the smallest expected loss. However, for these specific loss functions there is a much easier way: the minimum expected absolute loss corresponds to the median of the posterior, the quadratic loss corresponds to the mean, and 0-1 loss corresponds to the mode. That is, the same three point estimates we looked at in Part 1 of Probable Points and Credible Intervals! Why exactly this is the case is beautifully explained by John Myles White on his blog.
Note, that using the three loss functions above result in widely different decisions, it’s a big difference between landing the search team in the forest and in the windy mountains, and it’s a bit strange that the loss functions don’t consider aspects of the problem such as the terrain. Going forward I will explore a couple of different loss functions more suited to the “Where’s Robot?” scenario. This is not because these are loss functions that are especially useful and widely applicable, but rather because I want to show how easy it is to define new loss functions when doing Bayesian decision analysis.
We got a call from from management and they have decided that instead of sending a search team, we are going to do a satellite scan that is guaranteed to find any robot within a radius of 30 miles, and now they want to know where to target it. This calls for another loss function! As with 0-1 loss we want to minimize the expectation of not finding Robo, but now it is within a certain radius around the decision point x
. In R:
limited_dist_loss <- function(x, s, max_dist) {
mean(abs(x - s) > max_dist)
}
This code calculates the expectation of Robo being outside the max_dist
radius around x
, where max_dist
should be set to 30 in our case. Using this loss function with our posterior s
gives us the following graph:
So, we should center the scan on the 89th mile, which will scan the forest and part of the mountains, and which will result in a 1.0 - 0.4 = 0.6 probability of finding Robo.
We got new info from management: The droid carries some space station plans critical to the empire something something. Anyway, we need to find Robo fast, within 24 hours! Again we are going to deploy a search team that will split into two groups, but this time we need to consider how long time it will take for the teams to find him. It takes different amounts of time to search different types of terrain: a mile of plains takes one hour, a mile of forest takes five hours and a mile of mountains takes ten. The list cover_time
encodes the time it takes to search each mile, here we have the 48th to the 54th mile:
cover_time[48:54]
## plain plain plain mountain mountain mountain mountain
## 1 1 1 10 10 10 10
The following loss function calculates the expectation of not finding Robo within max_time
hours by calculating the time to Robo’s location from the starting point x
and then taking the expectation of this time being longer than max_time
:
limited_time_loss <- function(x, s, max_time) {
time_to_robo <- sapply(s, function(robo_pos) { sum(cover_time[x:robo_pos]) })
mean(time_to_robo > max_time )
}
This is how the expected loss looks for different starting points with max_time
set to 24:
Our best bet is to start at the 27th mile which means we will cover the whole plains area within 24 hours. If we instead had to find Robo within 72 hours it would be better to start at the 90th mile, as we now would have time to search the forest region:
A Bayesian interval estimate is the result of a decision analysis where you have found the best interval given a posterior distribution and some loss function. To decide what interval or region to search through is perhaps a more natural decision when looking for Robo, rather than deciding where to land a search team. In part one we looked at some different type of intervals, one was the highest density interval (HDI) defined as the shortest interval that contains a given percentage of the posterior probability. The HDI can also be defined as the interval that minimizes an expected loss (the specific loss function is derived here, but is a tiny bit complicated). Say that we want to find Robo with 90% probability while having to search through the smallest region. The best decision would then be the following 90% HDI:
A strange thing with this type of interval is that we limit the probability of finding Robo. Surely we would like to find Robo with the highest probability possible. What’s limiting us from finding Robo with a 100% probability should be time, effort or cost.
Let’s define an interval that is limited by cost instead. The management have decided that the search operation will cost \$1000. One hour of search costs \$100 and using our knowledge about how long time it takes to search a mile of each type of terrain we can calculate the corresponding search cost:
search_cost <- cover_time * 100
search_cost[48:54]
## plain plain plain mountain mountain mountain mountain
## 100 100 100 1000 1000 1000 1000
So searching through the 48th to the 54th mile would cost \$4300, a bit over budget. What we want to find is the interval with the highest expectation of finding Robo but that costs no more than \$1000 to search through. The loss function is basically just a variation of limited_time_loss
but with two parameters: the lower and the upper endpoints of the interval. To find the best interval I, again, just try all combinations of upper and lower endpoints and pick out the interval with the lowest loss (highest expectation of finding Robo) which costs \$1000 or less to search through:
So, if we just have \$1000 to spend we should go for the easy option and just search through the high probability region on the plains. What if we had \$3000 to spend?
Then we should search through almost the whole plains region. What if we had \$20,000?
Then we should go for the forest (and still stay away from the mountains, they don’t make good fiscal sense).
Now, we call up management and tell them that “it’s all good and well that you want to spend \$1000 on finding Robo, but why \$1000 exactly? And wouldn’t you want to spend as little as possible?” They tell us that, yes, they would like to spend as little as possible and the reason for the \$1000 figure is because that’s what Robo is worth. Ok, so what we want to do is to mount a search that maximizes the expected profit considering that Robo is worth \$1000 and that it costs \$100 per search hour. This calls for a utility function (something you want to maximize) which is just the opposite of a loss function (something you want to minimize). While any utility function can easily be cast into a loss function, it’s sometimes more natural to think of maximizing utility (say in finance) than minimizing losses. Loss functions and utility functions can both be called objective functions.
This is going to be the most complicated objective function in this post. The search is going to happen like this: We are going to start the search at one location, search in one direction, and stop the search if (1) we find Robo or (2) we have reached the location that marks the end of the search operation. The decision we have to make is where the search starts and where it terminates (in case we don’t find Robo). What we want to calculate is the expected profit given such a decision. The following function takes a start
, an end
and a robo_value
, calculates the profit for each sample from Robo’s posterior position s
and returns the expected profit:
expected_profit <- function(start, end, robo_value, s) {
posterior_profit <- sapply(s, function(robo_pos) {
if(robo_pos >= start & robo_pos <= end) {
#that is, we find Robo and terminate the search at robo_pos
covered_ground <- start:robo_pos
robo_value - sum(search_cost[covered_ground])
} else {
# that is, we won't find Robo and terminate the search at end instead,
covered_ground <- start:end
- sum(search_cost[covered_ground])
}
})
mean(posterior_profit)
}
If we evaluate this utility function for (a representative sample of) all possible values for start
and end
, and with robo_value
set to 1000, the maximum expected profit decision is this interval:
Huh? That doesn’t look like an interval… But it is, you’re looking at an empty interval. The best decision is to not search for Robo at all (with an expected profit of \$0) any search operation will result in an expected negative profit (that is, a loss). Good to know! What if Robo was worth more money? Say, \$10,000?
Then we should search a small part of the plains, starting at the 12th mile, for an expected profit of \$835. If Robo was worth \$20,000?
Then we should search most of the plains, still starting from the left, for an expected profit of \$3733. Say, if Robo was really valuable?
Then we want to search the whole strip of land for an expected profit of \$16590. Notice that we should always start at the left, this is because it is relatively cheap to search the plains and our profit will be much higher if we find Robo before we spend to much on the search operation.
So this was just a sample of possible loss/utility functions for a simple toy problem, some better and some worse. I stuck with point decisions and interval decisions, but there is no reason for why you should be limited to single intervals. Perhaps you want launch several search parties, or perhaps you want to update the decision as you get new information. Constructing reasonable loss functions for real world problems can be very challenging, but the point is that Bayesian decision analysis still works in the same way as outlined in this post: (1) Get a posterior, (2) define a loss function, (3) find the decision that minimizes the expected loss. My toy example featured only a one-dimensional posterior, but the procedure would be no different with a multi-dimensional posterior (except for the added difficulty of optimizing a high-dimensional loss function).
Update: João Pedro at Faculdade de Ciências da Universidade de Lisboa has reimplemented the Robo scenario, but in 2D! Check it out, it’s really nice!
Another point I want to get across is that loss/utility functions are models of what’s bad/good and as such they can be made arbitrarily complex and are in some sense never “true”, in the same way as a statistical model of some process is never the “true” model. Or as Hennig and Kutlukaya (2007) puts it:
“There is no objectively best loss function, because the loss function defines what ‘good’ means.”
By the way, did you find Robo?
Hennig, C., & Kutlukaya, M. (2007). Some thoughts about the design of loss functions. REVSTAT–Statistical Journal, 5(1), 19-39. pdf
White, J. M. (2013). Modes, Medians and Means: A Unifying Perspective. link
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. Springer. Amazon link
Chapter 9 in Gelman et al (2013). Bayesian data analysis. CRC press. Amazon link
]]>Peter Norvig, the director of research at Google, wrote a nice essay on How to Write a Spelling Corrector a couple of years ago. That essay explains and implements a simple but effective spelling correction function in just 21 lines of Python. Highly recommended reading! I was wondering how many lines it would take to write something similar in base R. Turns out you can do it in (at least) two pretty obfuscated lines:
sorted_words <- names(sort(table(strsplit(tolower(paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")), "[^a-z]+")), decreasing = TRUE))
correct <- function(word) { c(sorted_words[ adist(word, sorted_words) <= min(adist(word, sorted_words), 2)], word)[1] }
While not working exactly as Norvig’s version it should result in similar spelling corrections:
correct("piese")
## [1] "piece"
correct("ov")
## [1] "of"
correct("cakke")
## [1] "cake"
So let’s deobfuscate the two-liner slightly (however, the code below might not make sense if you don’t read Norvig’s essay first):
# Read in big.txt, a 6.5 mb collection of different English texts.
raw_text <- paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")
# Make the text lowercase and split it up creating a huge vector of word tokens.
split_text <- strsplit(tolower(raw_text), "[^a-z]+")
# Count the number of different type of words.
word_count <- table(split_text)
# Sort the words and create an ordered vector with the most common type of words first.
sorted_words <- names(sort(word_count, decreasing = TRUE))
correct <- function(word) {
# Calculate the edit distance between the word and all other words in sorted_words.
edit_dist <- adist(word, sorted_words)
# Calculate the minimum edit distance to find a word that exists in big.txt
# with a limit of two edits.
min_edit_dist <- min(edit_dist, 2)
# Generate a vector with all words with this minimum edit distance.
# Since sorted_words is ordered from most common to least common, the resulting
# vector will have the most common / probable match first.
proposals_by_prob <- c(sorted_words[ edit_dist <= min(edit_dist, 2)])
# In case proposals_by_prob would be empty we append the word to be corrected...
proposals_by_prob <- c(proposals_by_prob, word)
# ... and return the first / most probable word in the vector.
proposals_by_prob[1]
}
Some thoughts:
adist
function. (A one line spell checker in R is indeed possible using the aspell
function :)sorted_words
vector would be a perfect target for some magrittr magic.NWORDS
variable in order to be able to extract the most probable matching word. This is not necessary in the R code, as we already have a sorted vector we know that the first item always will be the most probable. Still, I believe the two approaches result in the same spelling corrections (but prove me wrong :).HashMap<Integer, String> candidates = new HashMap<Integer, String>();
.Christmas is soon upon us and here are some gift ideas for your statistically inclined friends (or perhaps for you to put on your own wish list). If you have other suggestions please leave a comment! :)
A recently released game where probability takes the main role is Pairs, an easy going press-your-luck game that can be played in 10 minutes. It uses a custom “triangular” deck of cards (1x1, 2x2, 3x3, …, 10x10) and is a lot of fun to play, highly recommended!
Another good gift would be a pound of assorted dice together with the seminal Dice Games Properly Explained by Reiner Knizia. While perhaps not a game, a cool gift to someone that already has a pound of dice would be a set of Non transitive Grime dice.
A search for statistics and mugs or statistics and t-shirts results in a lot of good gifts, for example this t-test mug:
You could also support your favorite MCMC software by buying a STAN themed mug from their shop or why not come up with a custom layout yourself? (I’ve used Vistaprint before and those mugs turned out decent and cheap.)
R, Python and Julia are great tools that are perhaps becoming a bit too mainstream for the self-conscious data science hipster. Why not then give the joy of some retro calculation? Slide rulers are amazingly cool and while I don’t know if new are made they can be gotten cheap on ebay. The same goes for vintage pocket calculators (make sure to get one where the digits are in bright green or red). There is also the 50s book with the self describing title A Million Random Digits. (Don’t miss the hilarious reviews on Amazon!)
The XKCD web comic by Randall Munroe often touches upon statistical issues and while his recent book What If?: Serious Scientific Answers to Absurd Hypothetical Questions is not statistical per see, it contains heaps of amusing back-of-the-envelope calculations. You can also get signed prints of some of the comics, for example of #231 “Cat Proximity”:
I love comic books teaching statistics (which I’ve written about earlier) and my two favorites are The Cartoon Guide to Statistics by Larry Gonick and Woollcott Smith and The Manga Guide to Statistics by Shin Takahashi. Both are great in their own ways and are enjoyable both if you are a statistics padawan or already a master of the dark arts.
NausicaaDistribution sells cool stuff such as a Standard Normal Distribution Plushie, an Evil Cauchy Distribution Plushie and a lot more distributions of different shapes and alignments.
You can also buy a My First Number Sets Wood Puzzle for the budding number theoretician or Famous Statistician Embroidered Coasters to save the sofa table from the eggnog.
Here are some good popular science books that deals with different aspects of statistics and that anybody can enjoy:
Dataclysm: Who We Are (When We Think No One’s Looking) by Christian Rudder. I have not actually ready this (hint hint siblings…) but it’s bound to be good as it is written by the guy behind the OKcupid blog.
The Theory That Would Not Die by Sharon Bertsch McGrayne. The history of Bayes theorem and Bayesian statistics, contains almost no math but is fun and engaging anyway.
The Lady Tasting Tea by David Salsburg gives a more “classical” perspective on the history of statistics.
The Signal and the Noise by Nate Silver. If you know someone who, against all odds, haven’t already read this book then it is a great way to get that someone interested in statistics and data analysis.
Here are some slightly more serious books that I have enjoyed:
Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan by John Kruschke. The book that got me started with Bayesian data analysis, a pedagogical masterpiece that recently received a second edition.
Advanced R by Hadley Wickham. A great guide to serious R programming which is also freely available online (but is then slightly more difficult to gift-wrap…).
The Visual Display of Quantitative Information by Edward R. Tufte. This classic makes a great gift, not least because of its almost coffee table book like properties.
]]>