On teaching tidyverse to novices

by Danielle Navarro, 20 Feb 2020

My first experience in teaching R to psychology students was in 2011. I used R as the programming language to support my introductory statistics class, and I stayed very close to base R. At the time, teaching R to psychology undergraduates was an almost heretical notion. My colleagues thought that programming would be too hard for my students, and that I’d end up with a student uprising on my hands. Thankfully that didn’t happen. The students found R a little harder to learn than SPSS, but not much harder: as long as I was willing to devote more of the class to teaching the basics of programming, they did fine, and they appreciated the utility of R more than they ever appreciated SPSS. In fact, as my lecture notes became more comprehensive – eventually becoming learningstatisticswithr.com – I found myself the accidental author of a rather successful base R book pitched at undergraduate psychology students.

I’ve been a little slow to adapt my teaching to tidyverse. I like tidyverse a lot, I use it all the time in my own work, but I’ve dithered about the approach I wanted to take. My first attempt to incorporate tidyverse into my teaching was conservative (see psyr.org): I taught base R programming first and tidyverse later. It worked reasonably well, but it was really striking to me that while the students liked learning base R (particularly when I realised I could use TurtleGraphics to make it fun), they loved it when I introduced tidyverse (mostly ggplot2 and dplyr). What they told me in feedback is that it was only once they started working with realistic data that they – as experimental scientists in training – felt like they were learning a “real” research skill.

On the one hand this made me happy because I’d taught them something useful. On the other hand, it was frustrating: they had to wait until Week 4 of my class to have that experience, and I’d wanted them to have that “omg yes this is cooooool!” moment earlier in the process. And so, following Mine Çetinkaya-Rundel’s advice, I’ve decided to reorganise my teaching completely in 2020 and begin with the tidyverse. If it goes well, this new approach is likely to form the core of my long-promised never-delivered “tidylsr book”.

Today was my first day teaching using the new approach, and I want to quickly take some notes that I can revisit

Overall thought

It’s hardly an original observation, but students really like data visualisation. There is something viscerally satisfying about drawing a pretty graph so there is a lot of value in starting the class by having them work on plotting real data. So my starting point for my class was to write these data visualisation slides:

Due to the slightly peculiar structure of my teaching allocation, I taught from these slides twice in one day, and in both cases the level of enthusiasm in the classroom was so much higher than the corresponding class had been the previous year, and the reason is simple: by the end of a two hour workshop they had gone from complete novies (zero knowledge of programming, not even the basics) to doing this:

picture <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, colour = cyl)) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

This is a big win. Unlike my previous attempts to teach R, this gives the students a real sense that they’re learning something useful. I’m hoping that will give me enough “buy in” from the students that will sustain them later on when doing the less fun parts.

Notes from Week 1

  • Using RStudio cloud for teaching works really well and sidesteps all the painful installation troubleshooting issues. It also meant I could kick the can down the road regarding the file system and package management. Those should be “boring things for later”.

  • Both classes were 2 hour workshops in small groups (10-20 people) and the natural end point for “exercise 3” slide. What happened in both cases is that I spent the first 30 mins on the “welcome to the class” section and class discussion, and then 90 mins getting through that first half of data vis. So my guess (for now) is that my data visualisation slides actually make a 3-4 hour workshop in total?

  • Something that surprised me: the very first demo takes a long time because I’m stopping frequently to mention what keys I’m pressing, to foreshadow some basic concepts of computing, highlighting aspects to the RStudio IDE and so on. In fact, the natural end point to that demonstration is that they end up with a small script they’ve written themselves, complete with comments. So it’s quite long

  • The foreshadowing tricks work. They almost too well actually. That fourth exercise about “adding a regression line” naturally prompts them to ask the answer, so I ended up building the multilayered plot with geom_point + geom_smooth before I’d formally introduced plot layers. I wasn’t what I’d intended, but in retrospect I think it works very well. It means that they’ve actually seen the layering mechanism in action before the theory is introduced: the ordering ends up being concrete-to-abstract rather than the other way, and it seems to work

  • Getting them to force an error (as I did in the first workshop but not the second) was a good move, I think: it gives me an insight into what kinds of things they’ll see, allows me to talk about the “blah blah ERROR blah blah” fear problem also the class (see Jenny Bryan’s rstudio::conf 2020 talk). It also let me do the fun “class bingo” for “object of type closure” error

  • Deliberately “forgetting” to load tidyverse the first time was a good move: it let me talk a little bit about packages and foreshadow the content to come, without needing to draw too much attention to it.

  • Bringing up debugging tips organically during the DIY parts is helpful… two most common errors I saw people encountering were typos and mismatched parentheses, but with various different error messages. Possibly worth highlighting this explicitly next time??

  • Because a lot of the content is actually “in” that first live coding exercise, the structure of the class ends up quite flexible. The “paired discussion/coding” approach meant that the two classes went in slightly different directions, but ultimately covered most of the same content

  • My classes tend to be almost all women. In and of itself I don’t think that changes much, except that I think it’s helpful to deliberately use feminine-coded examples in order to balance the tediously masculine-coded defaults in a lot of programming contexts (e.g., I really don’t think starting with mpg is wise, but at least it lets me own my own total lack of knowledge about cars!)

Week 2