A metamorphosis, part one
by Danielle Navarro, 27 Dec 2018
As Danielle’s Project awoke one morning from uneasy dreams she found herself transformed in her sleep into a git repository. She was lying in a soft RStudio project and when she lifted her head a little she could see her awkward history divided into strange blobs in the bottom of a .git folder from which she felt she could be checked out completely. Her numerous dependencies, which were pitifully fragile compared to the rest of her bulk, were swept wholesale into her body.
What has happened to me? She thought.
One of my goals for 2018 was to invest more time on learning new skills, particularly in relation to tidyverse ecosystem in R, literate programming (rmarkdown, blogdown, xaringan, etc), and proper project management through git and GitHub. My reasons were partly selfish - these are things I like doing anyway - but also somewhat strategic. For the longest time I’ve felt unhappy with how difficult it’s been to properly document my computational modelling work, and I have the suspicion that it’s partly tied to these skills.
To me this really matters - at some point in the near future, when I’m building a new computational model what I want to be able to do is keep track of all the little decisions that I make during the process, the idea being that later I (or someone else) can retrace my steps along the garden of forking paths - by tracking each little choice, I’m hoping I’ll be able to work out what the set of “plausible, counterfactual models I didn’t try out” might look like. My current workflow doesn’t support this well because when I’m building a model I do it quickly, in real-time, and ideas about possible models come to me faster than I can document them. That’s a problem because when I do come up with a model that I like, I don’t have a very good record of the path that took me to it. Yes, I can then create a preregistration for the predictions that this model makes, which I can then apply to a new experiment, but even if I do this it’s a spectacularly weak test of the model, because the subsequent model testing won’t be against the “nearby counterfactual” models, it will be against weaker competitors that I constructed after the fact. To the extent that a “severe” test of a model requires you to pit your model against the best alternatives you can think of rather than the worst, it’s a huge pain that I don’t have this audit trail.
My hope, in the medium term, is to get in to a habit of doing many quick commits during the model construction process (ideally never leaving RStudio), so that I can leave behind a proper trail of auditable breadcrumbs in the commit history without breaking my train of thought.
Obviously, for this to work, step one in this process has to be to start organising all my projects into git repositories and RStudio projects. As much as possible, I don’t want to have my code sprawling across multiple poorly documented folders, and my current process (keep everything in Dropbox) doesn’t do much to impose structure on me. Hence the metamorphosis… from a Dropbox-first mentality to a GitHub-first one. So let’s get this party started!
Wait what am I doing?
At the moment, my Dropbox-centric structure has top level folders for
other. It’s a bit pathological though because a lot of my teaching materials are on the web and a lot of my research is spread between the Dropbox
research folder and my GitHub repositories that aren’t under Dropbox at all.
Hm… okay, looking at this, I’m not really sure that all of this really belongs in git repositories. Almost everything in the
other folder is administrative work, notes to self, cheatsheets, budgets etc. Really, that should all stay in Dropbox. At a later date I should organise all that stuff properly because it’s a huge mess, but that’s a different problem.
Next question, what am I trying to achieve with the other content? It seems to me that I started this with a kind of naive “git = good, dropbox = bad” idea, which is really not the right attitude to take. A better perspective is that “git is for project management, dropbox is for everything else”. So the question now becomes, which parts to my
website should count as a “project”, and how should they be carved into projects?
Right. For the moment, let’s ignore the idea of “top level” categories entirely (i.e., don’t bother trying to classify a project as “teaching” or “web” or “research”), and just pull out things that are meaningful units. My general rule:
Each project should be maximally self contained, and my default should be to ensure that projects map one-to-one with git repositories
Hm… looking at my current structure, there’s a lot of things that break that rule. For instance, none of my
teaching material fits that rule. My “cognitive science” class, for instance, groups the administrative information like class lists, grades etc (not suitable for public archiving) together with the source materials content that I distribute (which should be part of a project), but then separates it from the actual webpage that does this distribution. Worse yet, the distribution of all of my class materials get lumped together in a single webpage (compcogsisydney.org) that also archives my research materials etc etc. This won’t do at all.
One class, one repo, one static site
So what are the features of this section of my work?
- First, it’s very clear to me that the administrative content is not actually part of the project. It changes rapidly year to year, and (hypothetically, because I love UNSW) if I were to move to a different university the administrative content would not port over to my new institution, but everything else would remain relevant.
- Second, while it does make sense to have a teaching page on my lab webpage, the lab site shouldn’t actually be hosting all those classes. Each class is its own thing, so the teaching page on my lab site should link out to the classes rather than double as the home for those classes.
- Third, my the backend to my lab website is massive overkill for this situation. I built it as a full fledged web application using Google App Engine, but 99% of what I need it for is to be a static site, so all that back end is serving no purpose except in a few very very specific cases. I don’t want to build GAE sites for every bloody class I teach.
- Fourth, the static webpage for a class belongs naturally with the git repository that I use to develop the teaching materials… which seems like an obvious use for GitHub pages.
- Everything administrative stays in Dropbox, inside a
- The course content for each class is stored in a git repository, and served via GitHub.
So that gives me the following repositories…
and the corresponding pages…
Ah, but what about the “big” teaching resources, the Learning Statistics with R book and the R for Psychological Science classes? Again, both of these are spread over multiple locations, and they’re both being awkwardly hosted through the lab website. Those won’t work the same way will they?
Yes, yes they will. This is actually a problem we already solved for the Complex Human Data Summer School, and the solution is exactly the same as for the teaching pages. The only difference is that each of these is a big enough project that I feel weird having it hang off my GitHub pages site, but it’s pretty easy to buy a domain for each of them and then map the corresponding site to the domain. So again we have these repos:
and they map to
What about the blogs?
I’m really liking this approach. I have quite a few other blogs (and assorted sites) that I’ve been hosting as App Engine sites that are way too overpowered (and hence finicky) for this purpose. I started using App Engine for this because it was the only thing I knew - I need all that functionality when writing web-based experiments, but I never really learned anything else, so when I started making static sites I just copied what I did in the past. That was fine up to a point, but it’s just clunky. So let’s do the same thing for the blogdown sites:
Each of these is now a single repository and I’m now in the process of moving them all from Google App Engine to GitHub Pages.
In doing so, I incidentally discovered why sometimes my blogdown sites lose their styling when shown in the Viewer pane in RStudio, but only on my Windows machine and never on my Mac. For many themes, the path to the CSS files are specified relative to the root directory using a path like
/css/style.css. This can fail on Windows machines (because of the backslash issue, I think?), and it also fails if you’re doing what I do and push the blogdown site to a subdirectory of your GitHub pages site. For instance if you’re hosting the blog at
username.github.io/reponame the path to the CSS file points to
username.github.io/css/style.css rather than where it needs to be, which would be
username.github.io/reponame/css/style.css). Fortunately for me, when I map
reponame.com this problem vanishes, so my blogdown sites all work again!
Learn to love the (right kind of) Sprawl
One lesson I’m taking from this “one project, one repo, (maybe) one site” ethos is that my projects are sprawling across many different parts of the web, rather than many different parts of my folder structure. This kind of sprawl is good, I think, as long as I make sure that there is at least one site that collects all the links together in a somewhat coherent fashion… which is actually the intended function of compcogscisydney.org, so yay!
This is kind of nice - my lab website is returning (bit by bit) to its core function, and my projects are starting to look coherent and sensible on my machine.
What am I giving up?
There are no free lunches in life. It’s worth taking a moment to think about what I’m losing by doing this. There are three things that worry me:
- Structure: Dropbox lets me preserve my file hierarchy, but GitHub is flat. This wouldn’t be a problem because I can structure folders locally except for the fact that I move between laptops quite frequently, so I need the structure to be “on GitHub” somehow.
- Large files: Every time I change a large binary file, git stores a copy in the repository. That’s going to be a problem with some pdfs and powerpoint slides, especially given that I have to deal with the “shitty Australian internet” problem. Large repos are bad.
- Mass synchronising: Dropbox automatically syncs local copies, but GitHub doesn’t. That’s generally good, but occasionally I have to move to a new machine that I haven’t used in months and it would be wonderful to have a one-line “mass git pull” that would update all my repos on that machine.
I considered several possible solutions to this:
- Maybe I should use git submodules to preserve structure? NOOOOOOOOOOOOOO. Twitter was emphatic on that point. As Jenny Bryan put it “run, do not walk, away from the submodule idea”.
- Maybe I could use git large file storage to cope with the binary files issue? Twitter was kind of mixed on this one, but the dealbreakers for me were (a) discovering that GitHub places limits on how much you can use LFS, and (b) realising that it’s a bloody nightmare to remove LFS from a repo once you’ve started. You basically have to use
git filter-branchand NOPE NOPE. The few repos I’d tried it on I had to burn to the ground and recreate from scratch just to make sure I’d nuked LFS from orbit.
Eventually I’ve settled on this. For the “large file” issue:
- In most cases, just live with the large file issue. Most PDF/PPTX files in my repositories don’t change very often, and they’re not that big. So don’t sweat it.
- In many other cases, the PDF files are output from something else, and the only reason they’re changing rapidly is that I’m changing the source code. In those cases there’s really no need to keep copies of the files at all, so
.gitignoreis your friend.
- In a few cases, there are massive files. For those, I’m planning to keep them in Dropbox, make them public through Dropbox, and then include an R script that links to them from the git repo. (thanks to Jacqueline Nolis for a closely related suggestion)
To preserve my directory structure in a way that I’m happy with, I’ve done a hacky thing. I’ve got a little git repo (https://github.com/djnavarro/thegitplace) that helps with this.
It has a CSV file that keeps track of my directory structure and the GitHub paths they map to, and a few helper functions that will allow mass clone or pull. The idea here is that if I have to move to a new machine, I can just clone
thegitplace and then use it to clone everything else into the appropriate directory (specified relative to the location of
thegitplace repo) on the new machine. It’s not perfect, and I have a lot of worries that I’m missing something important here, but at least I’m now satisfied that the “structure” problem and the “mass sync” problem are both solveable, so I’ll revisit the details later!
Tying up loose ends?
This structure captures most of my teaching/web content, but there are always strays.
My lab website is still being used to host working demos of all my experiments as well as copies of all my preprints. For now that’s fine, and in the case of my publications it’s important to have the files in that folder to help Google Scholar index them properly, but I worry about the experiments. I really feel like I need to start using Code Ocean or learning Docker. But frankly that’s a lot of work and I’m not ready for that yet so I’m going to kick that can down the road for a little longer.
I have a lot of cute side projects that don’t really belong anywhere, so I’ve bundled those into their own repositories…
… and those all live in a
fun folder on my local machine.
- Finally, there’s a few loose ends, little scripts that I’ve been trying to find a home for, but never quite got around to. For those, prompted by this exchange with Alex Hayes & Lisa de Bruine, my solution is going to be to just dump it in a GitHub gist (example) that I can search up later. I’ve done that whenever I’ve wanted to share code with others, but “future me” is a perfectly sensible person to share with so I’m going to go with that.
So far, so good. But all this is really the prelude to the main piece. I’ve worked out a structure that I think will function pretty cleanly for everything that isn’t my research, so the next thing to work on is how to get my research into this framework!