Photo by Tarik Haiga on Unsplash

Day 67-81: Dragon taming, Rcpp edition

Okay it’s been just over two weeks since I last wrote one of these posts, but for once it’s not my fault - I was asked to fill in at short notice as a keynote speaker for the useR! conference in Brisbane (the real keynote speaker was unfortunately unable to make it) so I had to write slides for a talk for a lovely but still somewhat intimidating audience of data science and statistics folks 😨. Plus I still had to write my talk for the midyear ALG meeting. Plus I had to go to both the conferences. Plus I had to take the kids to see a show about a book about a 104 Story Treehouse. I’m exhausted!

It’s nice to be back though. Both conferences I went to were wonderful and I learned a lot. In fact, after reflecting on both conferences and how lovely the people were1 I feel brave enough to face one of my deepest fears and learn how to write compiled C++ code for R using the Rcpp package.

But why though?

Given that I spent a good chunk of my useR! talk arguing that much of the reticence that social scientists, developmental psychologists and even cognitive scientists (who you might expect to be more excited) feel about learning R is inherently rational (and that it’s partly on us as R users to be welcoming and speak to new users concerns), I suppose it would be a good idea to explain where my own reluctance to learn Rcpp comes from, and why I’m trying to push through that and learn it anyway!

Fear

The fear is probably easy enough to explain. As an undergraduate student I was forced to learn ANSI C as part of the engineering degree that I never finished, and suffice it to say we were thrown in at the deep end without anyone explaining any of the core concepts properly. Not surprisingly, if you try to write C code without anyone really teaching you about memory management, pointers etc you get memory leaks and dereferenced pointers and other such awful things all over the place and then you get told off for being terrible at a thing that no-one ever taught you. Well, duh. Anyway… if you’re ever wondering “who hurt you?” about a recovering engineering student the answer is probably a dereferenced pointer.

Motivation

The motivation to face the fear is similarly straightforward: my R code runs too slowly for some of the problems I care about. It doesn’t come up that often, to be honest. Most problems I work on are small enough that it really doesn’t matter that my R code is slow. Other problems are standard enough that I can rely on other people’s compiled code to do all the work (e.g. linear models), or can be moved into a language (e.g. JAGS, Stan) that feels familiar to me as an R usuer so I find it less scary. But not everything falls into those categories.

One such issue came up at last week! I’m sometimes interested in how human reinforcement learning plays out for a variety of sequential decision making tasks, and it can often be valuable to know what the optimal decision making strategy looks like for a particular problem. For instance, in the “vanishing bandits” task I talked about at the ALG meeting (and made a brief cameo in my useR! talk too), the Kalman filter model I used isn’t actually optimal for the task, though it does perform rather well. To work out the actual optimal behaviour I need to solve a partially-observable Markov decision policy (POMDP) learning problem, which are notoriously difficult beasts and require fancypants dynamic programming techniques even for the simple problems I’m looking at, and the inference is slow. In this situation, I start genuinely caring about how fast my code runs.

Courage

I think it’s also worth saying something about working up courage. Some of it has come from watching people at an earlier stage of their own process doing the same thing. My UNSW colleague Jenny Richmond has been kind of inspiring in this respect over the last few months, and so many of the other people I met at the R-Ladies Sydney launch were similarly magnificent in talking about their own learning history and process. The useR! conference was also amazing that way - I got to meet a lot of the people I’ve only ever seen through twitter, some of whom are so much more accomplished at R programming than I will ever be, and without fail every one was really nice and super encouraging.

So, I am encouraged. Courage has been provided to take on the dragon!

Opportunity

Finally, it’s not enough to have the courage to face your fears – you also need some weapons to fight them! One thing I was so pleased to see at useR! is that the R Consortium filmed the presentations and is in the process of making them all available on YouTube. Because I wasn’t able to attend the tutorials (I was at ALG for that!) I missed them and was so happy to discover that they’ve been professionally recorded and made freely available. Best of all, one of those tutorials is by Dirk Eddelbuettel talking about Rcpp! Yay!

Dirk has very helpfully posted the slides for the tutorial on his homepage, and there are a wealth of resources on the Rcpp gallery site too. In the process of writing this I found Rcpp for everyone by Masaki Tsuda which is super useful too.

Okay! I can do this!

The tutorial!

The first half of the tutorial goes for about 90 mins, and comes in three parts. The first half hour covers some of the history and context for Rcpp - there’s no coding in this part, but I found it helpful to have a bit more of a sense of how everything fits together! The last 20 mins have cover some particulars about C++, which I found really useful to listen to, but for the most part were things I’ve decided not to worry about right now while getting started. For now, it’s really the 40 mins or so in the middle that I’m going to try out!

Evaluating an expression in C++

These days Macs don’t always ship with a C compiler, but since I’ve been playing around with Stan etc my machine is arready set up with one. After installing and loading the Rcpp package, all I have to is this:

library(Rcpp)
evalCpp("12 + 5")
## [1] 17

It takes a moment to compile, but then it produces the answer I’m looking for 🎉 - awesome!

As the tutorial points out, this code is being compiled to C++, and the compiled code is doing the work. To see what this is doing we can set verbose = TRUE and then repeat the exercise. To satisfy my own curiosity, I’ll ignore the tutorial instructions and do this using the exact same expression…

evalCpp("12 + 5", verbose = TRUE)
## 
## Generated code for function definition: 
## --------------------------------------------------------
## 
## #include <Rcpp.h>
## 
## using namespace Rcpp;
## 
## // [[Rcpp::export]]
## SEXP get_value(){ return wrap( 12 + 5 ) ; }
## 
## No rebuild required (use rebuild = TRUE to force a rebuild)
## [1] 17

This output originally didn’t make a lot of sense to me, but I’m getting more of a feel for it now, and in the process of writing up my notes here this all seems straightforward, so yay!

In any case, it’s worth noting that this is what Dirk is referring to in the tutorial, when he mentions that Rcpp is smart enough to realise it doesn’t need to compile the function to return "12 + 5" a second time - instead it just uses the version it had cached from the first time I called evalCpp above. I suppose I could use rebuild = TRUE to force it to recompile, but perhaps the wiser course of action is to go back to the tutorial and use a different expression…

evalCpp("12 + 7", verbose = TRUE)
## 
## Generated code for function definition: 
## --------------------------------------------------------
## 
## #include <Rcpp.h>
## 
## using namespace Rcpp;
## 
## // [[Rcpp::export]]
## SEXP get_value(){ return wrap( 12 + 7 ) ; }
## 
## Generated extern "C" functions 
## --------------------------------------------------------
## 
## 
## #include <Rcpp.h>
## // get_value
## SEXP get_value();
## RcppExport SEXP sourceCpp_3_get_value() {
## BEGIN_RCPP
##     Rcpp::RObject rcpp_result_gen;
##     Rcpp::RNGScope rcpp_rngScope_gen;
##     rcpp_result_gen = Rcpp::wrap(get_value());
##     return rcpp_result_gen;
## END_RCPP
## }
## 
## Generated R functions 
## -------------------------------------------------------
## 
## `.sourceCpp_3_DLLInfo` <- dyn.load('/private/var/folders/hj/s0m3nkpj6r97yy0my36rv8pc0000gn/T/RtmpzpyjbK/sourceCpp-x86_64-apple-darwin15.6.0-0.12.17/sourcecpp_3b7464e6abd8/sourceCpp_4.so')
## 
## get_value <- Rcpp:::sourceCppFunction(function() {}, FALSE, `.sourceCpp_3_DLLInfo`, 'sourceCpp_3_get_value')
## 
## rm(`.sourceCpp_3_DLLInfo`)
## 
## Building shared library
## --------------------------------------------------------
## 
## DIR: /private/var/folders/hj/s0m3nkpj6r97yy0my36rv8pc0000gn/T/RtmpzpyjbK/sourceCpp-x86_64-apple-darwin15.6.0-0.12.17/sourcecpp_3b7464e6abd8
## 
## /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB -o 'sourceCpp_4.so'  'file3b7473728a8e.cpp'
## [1] 19

Looks about right. It’s funny though. The first time I saw this wall of text output all my horrible nightmares from undergrad came flooding back to me, but having made it to the end of the tutorial it seems less scary to me looking at it a second time. In fact, about half of it actually makes sense to me, so that’s nice!

Writing a C++ function in R

Okay, the tutorial next goes on to introduce cppFunction, which allows you to define and compile a function in C++ that can be called from within R. It seems like magic to me, but it’s absurdly easy to do. The tutorial uses an example that doubles an integer, but my usual go-to example for defining a novel function is quadruple, so I’ll go with that…

cppFunction(code = "int quadruple(int x) {return x * 4;}")

This is a little more complex than the last example, but still pretty simple. The string is interpreted as C++ code, and cppFunction uses that code to define a function called quadruple that I can call from within R. I remember just enough of my old C classes from that this makes sense to me. With syntax highlighting added, this is what the function looks like:

int quadruple(int x) {
  return x * 4;
}

The key things here:

  • C++ requires that we define what type of variable the function returns, so we need to specify int quadruple
  • Similarly, the type of every variable must be defined, so we also need int x
  • Then all the function does is return x * 4 (semicolons mandatory, I believe!)

In any case the call to cppFunction creates a function called quadruple as a side-effect (it doesn’t seem to be a returned value in the sense I was expecting?) and I can call it from R:

quadruple(4)
## [1] 16

This looks no different to a regular R function, but internally it is using .Call to pass values to the compiled C++ code. You can see this by looking at the source code for quadruple:

quadruple
## function (x) 
## .Call(<pointer: 0x108dc6070>, x)

It’s also worth noting that C++ doesn’t “promote” integer inputs to doubles in the same way that R would. For instance, if I try to treat 4.3 as an integer in R it assumes I’m being foolish and coerces it to numeric:

4.3L * 4
## [1] 17.2

Using Rcpp that’s not what happens. A non-integer input to my quadruple function is treated as a integer, so when I call quadruple with numeric input I get this:

quadruple(4.3) 
## [1] 16
quadruple(4.9)
## [1] 16

Of course, that’s not surprising - it’s just what happens when you don’t pay attention to the type of variable you’re working with. Anyway, back to the tutorial itself…

Functions to sum a vector in C++

Exercise 2 in the tutorial asks us to write a function to sum a vector. We are first given this code snippet to work with:

int vecsum(IntegerVector v) { 
  int s = 0;         // initialise the sum, s
  int n = v.size();  // v.size() is the length of the input v
  // my code here
}

As a side note, IntegerVector is type used by Rcpp and there are some useful notes on it in the quick reference guide here. After one false start in which I forgot that I had to declare the type of my indexing variable i I ended up with the right answer. The C++ code looks like this…

int vecsum(IntegerVector v) { 
  int s = 0;         // initialise the sum, s
  int n = v.size();  // v.size() is the length of the input v
  for(int i = 0; i < n; i++) {
    s = s + v[i];
  }
  return s;
}

Thankfully, I actually managed to remember in C++ the first element of a vector is the 0th element (as it is in Python and Javascript) rarther than the 1st element (as it is in R and Matlab), so I didn’t fall for that trap! In any case, the corresponding call to cppFunction in R looks like this…

cppFunction("int vecsum(IntegerVector v) {
  int s = 0;
  int n = v.size();
  for(int i = 0; i < n; i++) {
    s = s + v[i];
  }
  return s;
}")

So now I have a function vecsum that I can call…

vecsum(v = 1:10)
## [1] 55

Yay! This is neat.

Following along in the tutorial, it turns out that Rcpp includes a lot of additional functions (“sugar”) that reimplement a variety of R equivalents, so there’s a lazier solution that wouldn’t normally be possible

cppFunction("int vecsum_alt(IntegerVector v) { return sum(v); }")
vecsum_alt(v = 1:10)
## [1] 55

Awesome! Apparently there’s a bunch of convenience functions like this. I have to admit really appreciate that and it’s a huge selling point for me… one of my biggest fears about trying to write my own compiled code is the thought of having to (badly) reimplement basic things in R and making mistakes, and I’m super glad I don’t have to do that. So yay!

Including longer Rcpp code chunks

The next step in the tutorial is the sourceCpp function which allows you to include an entire script that contains C++ code. Rstudio has nice support for this so I can edit my .cpp script and my .R script together and it’s all terribly smooth. Of course, I’m writing this post using blogdown, but no matter: R markdown supports Rcpp in a completely straightforward fashion. If I want my code chunk to be interpreted as C++ then I just define the chunk using {Rcpp} instead of {r} and knitr will know how to handle it (as far as I can tell the magic invoked via knitr::knit_engines?). So it’s easy enough for me to follow along with the tutorial from within R markdown.

Right, so continuing along, my next job is to define a vectorised function, one that will take a numeric valued input and multiply each element by 4. The tutorial suggests that this is what I need to do…

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector quadruple_me(NumericVector x) {
  return x * 4;
}

In this code chunk:

  • #include <Rcpp.h> is includes the Rcpp header file, without which it would be rather difficult to do anything 😀
  • The using namespace part acts much like library() in R, and means I don’t have to use :: every time I want to use an Rcpp function
  • The [[Rcpp::export]]`` comment is important: it's what tells Rcpp that it needs to export thequadruple_me` function to R. Otherwise the code will compile but nothing will be accessible from R, which would of course defeat the entire purpose of using Rcpp

Okay… that seems to do have worked? Let’s take a look:

quadruple_me(10)
## [1] 40
quadruple_me(x = c(2, 1.25, -4, 6.1))
## [1]   8.0   5.0 -16.0  24.4
quadruple_me
## function (x) 
## .Call(<pointer: 0x109052400>, x)
## <bytecode: 0x7f9395144728>

The speed up can be quite substantial

The tutorial points out that some things aren’t worth bothering to speed up. For instance, if you want to calculate 1/(1+x) there are some speed benefits, but it’s maybe a x5 speed gain, and not really worth it in most cases. Besides, as the R compiler gets better with every subsequent release, the gap between the two keeps shrinking over time anyway.

On the other hand there are a number of situations where there is a huge advantage. Recursive function calls in particular are very expensive in R, and that’s the exact issue that I’m facing: computing optimal MDP solutions via value iteration require recursive updating and it’s very slow if you try to do it in R without relying on clever tricks.

The example from the tutorial uses the Fibonacci sequence. One way to compute the n-th Fibonacci number would be to define the following fib_r function in native R code:

fib_r <- function(n) { 
  if(n < 2) {
    return(n)
  }
  return(fib_r(n - 1) + fib_r(n - 2)) 
}

This is spectacularly inefficient R code of course, since we’re computing the same thing many times over. The point of the exercise isn’t to write good R code here, it’s to highlight how much overhead is going into evaluating recursive function calls in R. So for comparison, here’s the same code in using Rcpp:

#include <Rcpp.h>
// [[Rcpp::export]]
int fib_rcpp(int n) {
  if(n < 2) {
    return(n);
  }
  return(fib_rcpp(n - 1) + fib_rcpp(n - 2));
}

Comparing the two shows a very substantial speed up to the compiled version:

system.time(fib_r(30))
system.time(fib_rcpp(30))
##    user  system elapsed 
##   0.867   0.006   0.893 
##    user  system elapsed 
##   0.005   0.000   0.005

So the speed up here is about a factor of 150 or so. (The tutorial uses the rbenchmark package to do much the same thing, which I should probably learn at some point!)

Other things?

One other thing I noted while watching the tutorial is that at a later stage it would definitely be worth organising any Rcpp code I write into a package. Evidently that’s useful for parallelisation etc. But I’ll defer that to later. I haven’t built a package since the days when package.skeleton was the usual way to do things, and I feel like I should learn more about that too. In fact, the first 10 mins of the second half covers how to get started with writing a package that uses Rcpp and just incidentally…

OMG package creation is sooooo much easier than I remember it being back when I had to create one with package.skeleton and deal with the horror of the .Rd format. I might actually be willing to try writing packages again!!!

Anyway, speaking of the second half of the tutorial, it’s is a little longer - the recording is about 2 hours - and it’s focused on more advanced content that I’m not quite ready for yet…

I did take a quick peek at it though, browswing through it to see what I should expect further down the road. A few notes to myself for later…

  • Calling R functions from inside Rcpp can be useful. It’s still slow, of course, since it’s R code, but that’s not such a bad thing if it’s a one-off thing to initialise something (about the 18 minute mark in the tutorial)
  • The “sugar” features in Rcpp are really nice. For instance, it allows vector subsetting so, e.g., x[x>0] is valid when writing Rcpp code.
  • Matlab style matrix operations are available via RcppArmadillo (at about 20 minute)
  • Access to boost via BH. (~25 min): Note that there’s an example of a try-catch construction at minute 27, which I’m sure I’ll need later.
  • There’s a discussion on parallelisation (~50 min), with specific discussion about RcppParallel at 54 mins. This definitely worth revisiting because it talks about traps that its easy to fall for.
  • Later on there’s some things directly relevant: Kalman filters at 1 hour 20 mins, and machine learning with MLpack at 1hr 30 min

… but that’s all further down the track. It’s just nice to know that it’s there!

A test of my learning! First passage times for Bernoulli random walks

Okay, let’s see if I’ve learned anything useful. I’ll try to implement a random number generator for the simple first passage time distribution for a Bernoulli random walk. This distribution arises in psychology as a tool for modelling human response time data, though it is considerably simpler than more successful models such as the linear ballistic accumulator and diffusion model, both of which are already available in R through the rtdists package. But whatever. I just want to have something simple enough that will check that I understood the tutorial. Here’s my Rcpp code:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector random_walk_fpt(int n, double drift, int boundary) {
  
  // initialise RT vector and state variable for the walk
  IntegerVector response_time(n); 
  int state;
  
  // possible step directions for the walk & corresponding probabilities
  IntegerVector step_dir = IntegerVector::create(-1,1);
  NumericVector step_prob = NumericVector::create(drift, 1-drift);
  
  // random walk model
  for(int i=0; i < n; i++) {
    state = 0;
    while(abs(state) < boundary) {
      response_time[i] = response_time[i] + 1;
      state = state + sample(step_dir, 1, false, step_prob)[0];
    }
  }
  
  // return RT vector
  return(response_time);
}

This defines a random_walk_fpt function that I can call from R. So I’ll generate 10000 random response times using this function and draw a picture:

response_time <- random_walk_fpt(n = 10000, drift = .55, boundary = 40)
hist(response_time, breaks = 50, col = "#88398A", border = "white")

It works! 🎉

That was a lot less painful than I was expecting.


  1. It’s a rough experience as a psychologist jumping striaight from pretending to be a neuroscientist at one conference to pretending to be a statistician at the next, but in both places everyone was really kind and encouraging.

Related