Image source

Days 22-23: Meddling in the affairs of wizards

For the longest time I have been terrified of working with some of the low level features of R. So very often I would have been digging around by myself in the documentation - back when I started I don’t think there was an #rstats twitter community, I didn’t have anyone else around me using R, and I found the R mailing lists pretty intimidating because, well, a lot of the comments felt, um, slightly less than welcoming shall we say? - and in doing so I would be referred to The White Book or The Green Book or some other tome of forgotten lore (here is a nice summary of the coloured R books). I tried to read them, honest I did. But I didn’t have the time, the training, or the people around me to make it feasible, so I have struggled to grasp what the hell quote, eval, substitute deparse, etc actually do. I’ve had to use them on occasions, but I have always had the strong feeling that I am meddling in the affairs of wizards when I do.

(source)

Happily, it turns out that either the world has changed or I have!

Among other things, the tidyverse has come along and made a lot of R programming easier to work with, but there’s more of a community (or at least more of one that is accessible to me!) and best of all, there is the Advanced R book by the always-incredible Hadley Wickham… and it has a whole section on metaprogramming, and this entire post is me trying to summarise what I’ve learned from reading just the first part.

I don’t want to exaggerate here but, well, reading through it had kind of the same feeling as watching the opening credits to The Sopranos and hearing this music…

I’m not sure that’s healthy. Woke Up This Morning isn’t exactly the most upbeat song in the world - and I think the take-home message of the film clip is that the character I immediately empathise with in the clip is probably the one that gets shot at the end - but it makes me feel cool and awesome and dangerous just listening to it anyway, and to be perfectly honest the risk of being shot for no apparent reason feels like pretty apt metaphor for my fears about metaprogramming too. 😀

So I’m going to risk it and do this thing anyway!

Clever girl

One of the things that always puzzled me from the beginning was how this works:

my_data <- rnorm(1000) # some normally distributed observtions
hist(my_data)          # a histogram

It’s so simple and clean that I almost always use some variant of it in my “getting started in R” classes, just to show people how easy it is to generate and plot simulated data. Yet, in all honesty, I am confused by this code…

There is some secret wizardry going on here.

I had always thought that R is a pass-by-value language. So when I pass my_data to the hist function, the “thing” that gets handed over to the function isn’t the label (i.e. “my_data”), it is the set of underlying values (i.e. “0.02743”, “1.5683” etc). This is a nice, neat, simple way of explaining how functions work. But as H. L. Mencken reminds us,

Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong.

If this explanation were true, how the fuck does hist know how to title the bloody graph????

THIS HAS BEEN BOTHERING ME FOR YEARS. Somehow the function has the ability to access both the value passed to it via the arguments and the code that produced it. I’m still learning, but here is my understanding of what is happening.

When the hist function is called with my_data as the argument, the thing that is being passed here is not strictly speaking the actual variable. It is a “promise” object. R is in effect agreeing that it will evaluate the “expression” (i.e. my_data) that it has been passed but it is lazy - it doesn’t actually do anything with the expression until it absolutely has to. At the user level we never really notice this happening because almost every time we try to do something with a variable we are creating R expressions that the language actually does have to evaluate: I can’t possibly answer x + 1 without actually doing something with x so the promise is filled the moment I try to print out the answer or whatever. My understanding is that the main reason for this is efficiency - don’t waste memory or CPU time doing things that you don’t actually need.

However, another nice property (though an easily abused one) of all this is that – because the expression is the thing being passed to the hist function – it is possible to grab the expression without evaluating it. This is what the substitute function does. Normally, any time you interact with an expression in R it gets evaluated, but substitute returns the unevaluated expression itself. This function pairs very nicely with deparse which takes an unevaluated expression and converts it to regular text. So I can write a function like this…

codeText <- function(expr) {
  paste(
    deparse(substitute(expr)), # the unevaluated call as a string
    "evaluates to", 
    deparse(expr)              # the evaluated call as a string
  )
}

So now if I define some variables x and y and ask codeText to tell me something about x^y I get this:

x <- 3
y <- 4
codeText(x^y)
## [1] "x^y evaluates to 81"

Oooh, nice!

I can use this for other mathematical expressions…

codeText(sin(x)+cos(y))
## [1] "sin(x) + cos(y) evaluates to -0.512523612803745"

I can use it for other R function calls…

codeText(rep.int(x,y))
## [1] "rep.int(x, y) evaluates to c(3, 3, 3, 3)"

Yesssssss. Excellent. So this, apparently, is the sneaky trick used by the plotting functions to grab the variable names and use them in the labels. Very clever.

My self-image right now:

(i-want-this-tank-top-so-badly-now)

Language games

R provides you with ways of directly creating promise objects using the delayedAssign function. Normally when I assign a variable using <- or assign, the new variables take on their values at that time. The delayedAssign function lets me mess with that. So normally I would do this

city <- "Adelaide"
home <- city
city <- "Sydney"
print(home)
city <- "Melbourne"
print(home)
## [1] "Adelaide"
## [1] "Adelaide"

The home variable has value Adelaide because that was the value of city at the time I assigned myself a home. That’s exactly how R normally does things and not surprisingly the assign function does the same:

city <- "Adelaide"
assign("home", city)
city <- "Sydney"
print(home)
city <- "Melbourne"
print(home)
## [1] "Adelaide"
## [1] "Adelaide"

In both cases the home variable is given its value the moment the <- operator or the assign function is used. The delayedAssign function on the other hand, defers the assignment operaation until the first time the home variable is actually used for something. Printing it out is sufficient to trigger the assignment so we get this:

city <- "Adelaide"
delayedAssign("home", city)
city <- "Sydney"
print(home)
city <- "Melbourne"
print(home)
## [1] "Sydney"
## [1] "Sydney"

Notice that the home remains Sydney for both calls to print. The delayedAssign function doesn’t turn home into a pass-by-reference style assignment where changing the original object (city) changes the value of any subsequent assignments. We aren’t doing object oriented programming here I guess, we’re just delaying the pass-by-value action.

I guess I’m stuck as a Sydney R-Lady now. Oh dear. How ever shall I cope?

Controlling where an expression evaluates

Next up in my list of “things that Dani has been confused by for ages and Hadley Wickham explains in a way that makes it seem bloody obvious”… quote and eval. I’ll be honest. For the longest time I have been entirely at sea trying to understand what in the ever-living fuck the point of the quote function is. I mean, look at this:

quote(x)
## x

Uh huh. Great. This is me right now:

The thing I’d been missing, is that the quote function (which is rather like the substitute function but somewhat simpler in a bunch of ways that Advanced R explains nicely) is really just a useful tool to prevent R from immediately evaluating an expression, and pairs very nicely with the eval function.

Okay so the eval function takes an unevaluated expression, and evaluates it:

x <- 1
eval(x+1)
## [1] 2

Uh huh. Again. Nice but this is still me right now:

However, the eval function also has an envir argument that lets you specify where the evaluation should take place. You can do it in the global workspace, or inside a list or data frame, or even a custom environment that you create. Okay so let’s do this. First I’ll create a data frame df that contains a city variable, and I’ll also have a variable called city in my workspace.

df <- data.frame(
  city = c("Adelaide","Sydney"),
  pop = c(1.2, 4.4)
)
city <- "Melbourne"

So next I’ll try to evaluate the expression city == "Melbourne" inside my data frame df:

eval(city == "Melbourne", envir = df)
## [1] TRUE

Okay WTF. That did not do what I wanted it to do. Why? Well, if I’d bothered to actually read the section on non-standard evaluation properly I would have realised the problem. The expression I’ve fed to eval is being evaluated too early. I need to quote it to protect it from being evaluated prematurely:

eval(quote(city == "Melbourne"), envir = df)
## [1] FALSE FALSE

Aha!

In effect quote and eval are opposites to each other, so evaluating a quoted expression behaves the same way as the original expression. By pairing those two functions together by specifying the envir argument to eval, we’re ever so subtly breaking this symmetry by having the expression city == "Melbourne" be evaluated in a different environment (or in this case data frame df) to the one in which the output value is to be returned (in this case the global workspace). As it turns out I could have just used evalq as a shorthand, and sure I could have just done the immensely easier thing of typing df$city == "Melbourne" as my command, but I can see how the eval + quote trick allows a lot of flexibility to do clever things, though I can also see how this is super dangerous!

To be continued!

This is all really useful, and I feel like for once I have a better handle on what these weird functions are actually doing. Yay! There’s quite a bit more to non-standard evaluation that gets discussed in the Advanced R book, but that will do for now. Next up on my metaprogramming reading list, the section on expressions

It’s all a bit scary, but as I was reminded by Grace Hopper, John Shedd, and others this morning when logging into slack…

A ship in harbor is safe, but that is not what ships are built for.

Avatar
Danielle Navarro
Associate Professor of Cognitive Science

Related