Image source: Wikipedia

Day 55-62: R: The Boring Bits

It’s been a little longer than usual between posts. I’ve been extremely busy with work, and if you’ll excuse me taking a moment to celebrate achievements I’m really pleased with how it’s been going lately: I’ve wrapped up a few projects and submitted four preprints to PsyArXiv in the last month - here, here, here and here - so yay me! 🎉 It’s always nice to take a moment to celebrate when things actually seem to be going okay.

Plus, I’ve been devoting a fair bit of time helping out with organising the R-Ladies Sydney launch event. It was such a great event (largely due to all the great work that Jen, Steph & Lisa put in organising it), the talks were fantastic, and I had a wonderful time!

The “random walk on CRAN” project, however, has been on hold for a bit - and in truth today’s post is a bit of a cop out because there’s no package here at all and barely anything resembling code. Instead, it’s some initial thoughts about how to revisit some of my teaching material.

Background

Now that the R-Ladies Sydney launch is done - and with it the very cool RCurious workshop that Charles Gray ran at UNSW - we’ve started thinking about other possible events. I think we’re planning to get our act together soon to make proper announcements about what events are likely to happen, but we’re not quite there yet.

Anyway… something that came up during the planning process was a suggestion the other organisers made that we look at some of the courses on Datacamp. Like pretty much every other cool thing about R, Datacamp didn’t exist when I started learning the language, so I’ve just started working my way through a few of their courses… and got sucked in enough that I decided to pony up for a subscription. In a future post I might talk about my experiences playing with these courses. Or alternatively, I’ve wanted to play around with the swirl package for a while.

For today though, I have a different goal…

For a little while now I’ve been toying with the idea of writing some introductory material of my own (or, more accurately, revisiting some of the introductory material from Learning Statistics with R and the related R for Psychological Science in a smaller workshop form that makes better contact with tidyverse). I haven’t gotten very far in doing that. We chatted a bit about maybe one day having something like this as part of the R-Ladies Sydney group, but it’s definitely not there yet so we’ll probably start out by thinking about how to work with existing resources :-)

But still, it’s nice to think about what we might do if we have more time. Plus, I’m teaching another intro to R class next semester, and while I’m almost certainly going to copy what I did last semester due to a lack of time, it’s nice to start dreaming about what I might do when I have the time to plan further ahead.

I’ve seen a few different approaches that other people have taken and am still trying to work out what I want to do. So for the purpose of today’s post, I thought I’d write out a first-pass skeleton of what the structure might be look like. At the moment it’s really just a list of: topics that need to be covered; advanced concepts that need to be foreshadowed (“mentions”) because I’ve seen a lot of beginners run into them in the wild and it’s nice if they don’t get tripped up by them; and a few – not near enough – spots where hands on exercises can be added.

R: The Boring Bits

The sales pitch…

Imagine a world in which there is R, but there is (mostly) no tidyverse and no RStudio. There are no fancy tools, just the primitive bits of R. How would you construct a data frame? What “things” is a data frame made from? How do we manipulate it? What is “$” anyway? What is a list? Why won’t R enthusiasts STFU about pipes? What’s the relationship between a data frame and a list? How do we write functions? How do we write simple programs that can work with our data? Why do people on twitter sometimes say “readr::read_csv” when they probably just mean “read_csv”? The goal in this workshop is to answer these kinds of questions.

In truth, there’s not a lot of fun stuff in this! There’s no data analysis & no pretty graphics. We won’t even cover R Markdown. It’s entirely focused on low-level nuts and bolts things, and the main point in doing this is to set the stage, so that people who are currently beginners will be able to follow the later workshops on [insert list of cool stuff]

We’ll cover the core data types, fundamentals of the language, and some basics of coding practice. In the process we’ll try to mention some standard traps that people fall into and explain the weirdness of common error messages (famously: object of type ‘closure’ is not subsettable. wtf???). Basically, you’ll be bored to tears but it’s the stuff that’s super handy later. Strictly speaking, the only assumption is that you have a working copy of R & RStudio, but it would be useful if you’ve had a bit of experience typing commands at the console

(1) Getting started

Data types

  • (work at the console first)
  • Introduce numeric data (and arithmetic operators +, * etc)
  • Introduce character data (some operators work for this data e.g. <)
  • Introduce logical data (discuss logical operations ==, |, &, etc)
  • (mention) Integer data exists… because they will encounter 42L etc
  • (mention) The rather unhelpful soft binding of T <- TRUE
  • (mention) The use of class()

Variables

  • Variables as labels for data
  • Creating variables with <-
  • (mention) =, ->, assign() are also possible
  • (describe) R is pass by value x <- 3; y <- x; x <- 5; y returns 3, not 5

Packages

  • What is a package?
  • What does it mean to “install” one (with install.packages)
  • What does it mean to “load” one (with library)

Scripts

  • (a lie) Scripts are text files with commands executed from top to bottom
  • Tiny differences in default behaviour when you “source” a script vs type at console
  • (from now one we’ll work from scripts)

(2) Data structures

Vectors

  • Creating a vector with c()
  • Subsetting vector elements by position
  • Subsetting vector elements by name
  • Subsetting vector elements logically
  • Operations with vectors: adding, subtracting, etc
  • Vectors have length but no “shape” (dimension)
  • Some shortcuts: :, rep, seq
  • (mention) Negative indices tell R to drop values
  • (mention) The recycling rule: strangely R allows (1:4) + (1:3)

Matrices

  • Binding vectors into a matrix using rbind and cbind
  • Discuss names with rownames() and colnames()
  • Matrices have rectangular shape (i.e., have dim as well as length)
  • Indexing with [,]
  • (exercise) Use this to manually construct a cross-tabulation with names

Lists

  • How to create a list using list()
  • “Hadley Wickham’s fine dining experience”: Understanding the difference between [ ] and [[ ]]
  • Indexing lists with $ operator
  • (mention) Under the hood, many things in R are secretly lists or objects that have “list-like” behaviour (e.g., an R command is treated a “language” object that behaves a lot like a list if you know the tricks); emphasise that’s an advanced topic that they don’t have to are about

Data frames

  • Data frames are lists that have rectangular shape!
  • You can index them like a matrix [,] or like a list [ ], [[ ]], $
  • (mention) Tidyverse concept: tibbles are just data frames that avoid weird special cases
  • (exercise) Manually construct a small data frame for a simple data set

Other coments

  • Missingness (NA), undefinedness (NaN) and non-existence (NULL)
  • (mention) Factors are useful for data analysis but we’re skipping them for now
  • (mention) Formulas exist. Expect weirdness when you see ~ but don’t sweat the details now.

(3) Programming concepts

Flow control

  • Revisit the lie about scripts
  • Conditional branching with if and else
  • Constructing loops with for and while
  • (comment) Vectorised code is totally a thing, but a topic for the future

Functions

  • Why do we bother writing functions? Modularity = mental health
  • How to write a function
  • How to set default arguments
  • Understanding how the function has its own workspace (a.k.a. “environment”)
  • Functions as objects
  • (mention) The idea of the scope chain: which variables are visible to who???
  • (mention) The dots argument ...
  • (mention) You can’t subset a function; which is an object of type closure, so mean[1] is how gives the classic error
  • (mention) Some functions are fancy, and do surprising things because object oriented programming (S3, S4, etc…) is a thing; ignore it for now
  • (breather) Don’t panic, most of these “mentions” aren’t important for beginners, they’re only discussed so that you don’t get weirded out when you see help documentation using ... or referring to “S3 methods” or some such.

More core concepts

  • Using :: to access unloaded namespace (because they see it on twitter)
  • Introducing pipes: %>% (because they are so terribly handy)
  • Operators are functions: 2+3 is the same as `+`(2,3)
  • Writing your own (useless) pipe %-->% operator, so that %>% doesn’t seem so magical
  • (mention) The existence of language objects and expressions. How does plot() know the names of your variables? How come you only need quotes some of the time… library(tidyverse) and library("tidyverse") both work, but install.packages(tidyverse) fails while install.packages("tidyverse") works? Answer: someone is trying to be clever, but it’s confusing if you don’t know the trick
  • (breather) Again, don’t panic. This is included only because smart people who don’t have programming experience often notice this inconsistency and (incorrectly) assume they’re the one making a mistake :-)

Interacting with the file system

  • Using here package to define a project root (point out that R projects are fancier)
  • Loading & saving workspace files
  • Loading & saving csv files (though they probably already know that)

Good? Bad? Ugly?

This list was basically my first thought about how I want to approach it, and it’s pretty traditional in most ways. From a teaching point of view I feel like there’s a tension here that needs to be managed… keeping the “core” content separate from the advanced “mention” material. My experience has been that people get worried (during the teaching session) if they feel like they actually need to know the advanced stuff then and there, which argues for its removal, but then I’ve also seen a lot of cases where they take the intro material and then the moment they try something novel they run straight into one of those “mention” cases; if they haven’t been prepped to expect a bit of that, some people just assume that they didn’t understand the content (which they did, the content was just too limited) and then drop out completely. So I feel like when putting something like this together, I’m going to want extremely clear visual separation (different colours?) between the two kinds of content.

Eh, something to think about later! 😀

Avatar
Danielle Navarro
Associate Professor of Cognitive Science

Related