Maps maps maps

by Danielle Navarro, 10 Oct 2019



A little while back, Hadley Wickham made the very foolish decision to let me help out with editing a third edition of his fabulous ggplot2 book. Personally I wouldn’t entrust me with anything so important, but seeing as he’s given me the vote of confidence I’m trying not to make too much of a mess. I’ve been enjoying it a lot as a writing exercise. Normally when I write articles, blog posts or even the occasional book, I’m the person (or at least one of the people) who has to make hard decisions, control the direction of the project, think about the audience etc. This time around, I’m not in that position and I have the luxury of passing off decisions I don’t want to make to the boss!

It’s awesome.

My first task in for this project was to make some extensions to the annotations chapter, which was cool because I discovered functionality in ggplot2 that I had no idea even existed, and also got to play around with a whole lot of other packages that I’ve never really looked at before. Good times!

My current job, however, is rewriting the maps chapter… also super fun, but I’m very much not a geospatial person myself so I’m having to learn the whole thing on the fly, and it’s times like this that I’m especially glad to be working with people who really do know what they’re doing! It’s a work in progress: the way we’ve been doing it is I make some edits in my forked copy and Hadley reviews them. This is a workflow I really like, and leaves me free to play around with whatever I like in my forked version.

That being said, I’m kind of inspired by Thomas Lin Pederson (again!) who is in the process of contributing some more technical chapters to the book. I really like his approach of soliciting comments from other people on twitter, and thought I might do the same. However, like I said, because this is a work in progress the “Danielle discovers that maps are hard!” section is lying around in a pull request at the moment, and indeed it’s not at all obvious to me how much (if any) of it will end up in the book. But I’ve had a lot of fun writing, so I figured I’d adapt parts of it to work as a blog post. Plus, since it’s a blog post I can be a bit more informal than I would be in a book. So here goes!

library(ggplot2)
library(dplyr)
library(ozmaps)
library(sf)
library(slumdown)
library(conflicted)
conflict_prefer("filter", "dplyr")
slum_plot_background("sky")

Prelude: Maps are hard

As a child I loved maps. I collected maps that came with my parents’ National Geographic subscription and plastered them all over my bedroom. Maps are awesome. As an adult, however, I have come to fear maps. Not because maps are bad, but because they are hard and I am not good at making them. I’m slowly learning, but oh my god I have made some absurd mistakes. A little while ago I had to create a world map visualising the nationalities of students at UNSW, as part of our Harmony Day events this year.

It didn’t seem like a hard job initially but I quickly ended up mired in all sorts of scary decisions: one data source represented Hong Kong and Taiwan as distinct nations from China, and another source collapsed them.1 I certainly do have some political opinions on this topic but geopolitics is so far above my experties and my pay grade that I was deeply uncomfortable about the whole thing. Never has a dplyr::case_when() call felt so distressing to me.

Also, I was so bad at the drawing of the maps that in order to place Australia in a more central location I had to delete Greenland entirely. I’m sorry Greenland! I love you, I do, it’s solely my incompetetence at work…

Plus there’s this particular monstrosity that arose when I was playing around with map projections, and the less said about that the better…

Since then, I’ve been trying to learn from my mistakes. For the moment I’ve been working on understanding vector maps, which are typically encoded using the “simple features” standard produced by the Open Geospatial Consortium.




Wait, they don’t love you like I love you
Wait, they don’t love you like I love you
Maps
Wait, they don’t love you like I love you
    - Maps, by Yeah Yeah Yeahs

Part I: Mapping with simple features data

I admit to having a love-hate relationship with simple features. The sf package developed by Edzer Pebesma https://github.com/r-spatial/sf provides an excellent toolset for working with such data, and the geom_sf() and coord_sf() functions in ggplot2 are designed to work together with the sf package. It’s a truly wonderful package, and it’s nicely documented too… it’s just that everything about geospatial data turns out to be more complicated than it looks so even though sf is really, really good, it’s still a huge pain to work with.

To introduce the idea of mapping with sf and ggplot2, I’ll use the ozmaps package by Michael Sumner https://github.com/mdsumner/ozmaps/ which provides maps for Australian state boundaries, local government areas, electoral boundaries, and so on. To illustrate what an sf data set looks like, I’ll import a data set depicting the borders of Australian states and territories:

oz_states <- ozmaps::ozmap_states
oz_states
## Simple feature collection with 11 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 112.9194 ymin: -54.75042 xmax: 159.1065 ymax: -9.240167
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 11 x 4
##    name          type    adm1_code                                 geometry
##    <chr>         <chr>   <chr>                           <MULTIPOLYGON [°]>
##  1 Macquarie Is… <NA>    AUS+00?   (((158.8657 -54.74993, 158.8382 -54.750…
##  2 Jervis Bay T… Territ… AUS-1932  (((150.6131 -35.18727, 150.6364 -35.144…
##  3 Northern Ter… Territ… AUS-2650  (((136.6955 -15.73992, 136.6636 -15.778…
##  4 Western Aust… State   AUS-2651  (((122.2469 -34.16245, 122.2379 -34.163…
##  5 Australian C… Territ… AUS-2653  (((149.3818 -35.34875, 149.367 -35.3573…
##  6 New South Wa… State   AUS-2654  (((150.7038 -35.12044, 150.6735 -35.124…
##  7 South Austra… State   AUS-2655  (((137.6229 -35.58213, 137.6343 -35.590…
##  8 Victoria      State   AUS-2656  (((146.4898 -38.7457, 146.5347 -38.7559…
##  9 Queensland    State   AUS-2657  (((153.4873 -27.41522, 153.5011 -27.417…
## 10 Norfolk Isla… Territ… AUS-2659  (((159.0689 -31.52093, 159.0833 -31.534…
## 11 Tasmania      State   AUS-2660  (((147.364 -43.37933, 147.3655 -43.3878…

The output prints out some of the metadata associated with the data set (discussed momentarily), and shows that the data set is essentially a tibble with 11 rows and 4 columns.

One advantage to sf data is immediately apparent: Australia is comprised of six states, four territories and Macquarie Island, which is politically part of Tasmania. There are 11 distinct geographical units, and accordingly there are 11 rows to this data set. That seems more intuitive to me than a “longer” data set that contains one row per polygon vertex. This data isn’t supposed to be a representation of a set of points it’s a representation of a set of Australian states. To my mind, a key principle of data analysis is to avoid “unpacking” the data into more primitive units unless you have a need to! The sf package makes this possible.

So let’s dive in a little more. The most important column in this data set is the geometry variable, which specifies a spatial “geometry” for each of the Australian states and territories. Each element in the geometry column is a “multipolygon” object and — as the name suggests — contains data specifying the vertices of one or more polygons that demark the border of a state or territory. Given data in this format, we can use geom_sf() and coord_sf() to draw a serviceable map without specifying any parameters or even explicitly declaring any aesthetics:2

ggplot(oz_states) + 
  geom_sf() + 
  coord_sf() + 
  theme_slum("sky")

This is a little unusual for ggplot2. Normally you wouldn’t be able to plot anything at all without using aes() to specify how the data variables map onto visual aesthetics but, as always… maps are different, and weird.

To understand why this works, note that geom_sf() relies on a geometry aesthetic that is not used elsewhere in ggplot2. This aesthetic can be specified in one of three ways. In the simplest case (illustrated above) when the user does nothing, geom_sf() will attempt to map it to a column named geometry. However, if the data argument is an sf object then geom_sf() is able to automatically detect a column that specifies a simple features geometry (discussed below), and constructs the mapping using that column regardless of what it is named. The third possibility is that the user can specify the mapping manually in the usual way with aes(geometry = my_column). The coord_sf() function governs the map projection, discussed later.

Layered maps

In some instances you may want to overlay one map on top of another. The ggplot2 package supports this by allowing you to add multiple geom_sf() layers to a plot. The code below draws a plot with two map layers: the first uses the oz_states data to fill the states in different colours, and the second uses the oz_votes data to draw the electoral boundaries:

oz_votes <- ozmaps::abs_ced

ggplot() + 
  geom_sf(
    data = oz_states, 
    mapping = aes(fill = name), 
    show.legend = FALSE
  ) +
  geom_sf(data = oz_votes, fill = NA) + 
  coord_sf() + 
  theme_slum("sky")

It is worth noting that the first layer to this plot maps the fill aesthetic in onto a variable in the data. In this instance the name variable is a categorical variable and does not convey any additional information, but the same approach can be used to visualise other kinds of area metadata. For example, if the oz_states data had an additional column specifying the unemployment level in each state, we could map the fill aesthetic to that variable.

Labelled maps

Adding labels to maps is an example of annotating plots and is supported by the geom_sf_label() and geom_sf_text() functions. For example, while an Australian audience might be reasonably expected to know the names of the Australian states (and are left unlabelled in the plot above) few Australians would know the names of different electorates in the Sydney metropolitan region. With this in mind, the plot below zooms in on the Sydney region by specifying xlim and ylim in coord_sf() and then uses geom_sf_label() to overlay each electorate with a label:

ggplot(oz_votes) + 
  geom_sf() + 
  coord_sf(xlim = c(150.97, 151.3), ylim = c(-33.98, -33.79)) + 
  geom_sf_label(aes(label = NAME), label.padding = unit(1, "mm")) + 
  theme_slum("sky")
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may
## not give correct results for longitude/latitude data

Adding other geoms

Though geom_sf() is special in some ways, it nevertheless behaves in much the same fashion as any other geom, allowing additional data to be plotted on a map using standard geoms. For example, we may wish to plot the locations of the Australian capital cities on the map using geom_point(). The code below illustrates how this is done:

oz_capitals <- tibble::tribble( 
  ~city,           ~lat,     ~lon,
  "Sydney",    -33.8688, 151.2093,  
  "Melbourne", -37.8136, 144.9631, 
  "Brisbane",  -27.4698, 153.0251, 
  "Adelaide",  -34.9285, 138.6007, 
  "Perth",     -31.9505, 115.8605, 
  "Hobart",    -34.9285, 138.6007, 
  "Canberra",  -35.2809, 149.1300, 
  "Darwin",    -12.4634, 130.8456, 
)

ggplot() + 
  geom_sf(data = oz_votes) + 
  geom_sf(data = oz_states, colour = "black", fill = NA) + 
  geom_point(
    data = oz_capitals, 
    mapping = aes(x = lon, y = lat), 
    colour = "red",
    size = 4
  ) + 
  coord_sf() + 
  theme_slum("sky")

In this example geom_point is used only to specify the locations of the capital cities, but the basic idea can be extended to handle point metadata more generally. For example if the oz_capitals data were to include an additional variable specifying the number of electorates within each metropolitan area, we could encode that data using the size aesthetic.

Map projections

In the book proper (at least as the draft currently stands), the chapter on mapping opens with a simple example – taken from the 2nd edition of the book – in which Hadley drew a couple of simple maps by plotting longitude and latitude on a Cartesian plane, as if geospatial data were no different to other kinds of data one might want to plot. This was how I was originally taught to draw maps, and to a first approximation this is an okay thing to do, but when one becomes genuinely concerned about accuracy in a way that a real geospatial person might, it no longer suffices.

There are two fundamental problems with the approach.

The first issue is the shape of the planet. The Earth is neither a flat plane, nor indeed is it a perfect sphere. As a consequence, to map a co-ordinate value (longitude and latitude) to a location we need to make assumptions about all kinds of things. How ellipsoidal is the Earth? Where is the centre of the planet? Where is the origin point for longitude and latitude? Where is the sea level? How do the tectonic plates move? All these things are relevant, and depending on what assumptions one makes the same co-ordinate can be mapped to locations that are tens of meters apart. The set of assumptions about the shape of the Earth is referred to as the geodetic datum and while it might not matter for some data visualisations, for others it is critical. There are several different choices one might consider: if your focus is North America the “North American Datum” (NAD83) is a good choice, whereas if your perspective is global the “World Geodetic System” (WGS84) is probably better.

The second issue is the shape of your map. The Earth is approximately ellipsoidal, but in most instances your spatial data need to be drawn on a two dimensional plane. It is not possible to map the surface ellipsoid to a plane without some distortion or cutting, and you will have to make choices about what distortions you are prepared to accept when drawing a map. This is the job of the map projection.

Map projections are often classified in terms of the geometric properties that they preserve: area-preserving projections ensure that regions of equal area on the globe are drawn with equal area on the map, shape-preserving (or conformal) projections ensure that the local shape of regions is preserved, and so on. It is not possible for any projection to be shape-preserving and area-preserving. It is a little beyond the scope of this post (or indeed the ggplot2 book, I suspect) to discuss map projections in detail. As a more detailed source, I’m kind of in love with Geocomputation with R https://geocompr.robinlovelace.net/ by Robin Lovelace, Jakub Nowosad and Jannes Muenchow. However, I will note one very important property, which is that the simple features specification allows you to indicate which map projection you want to use.

Taken together, the geodetic datum (e.g, WGS84), the type of map projection (e.g., Mercator) and the parameters of the projection (e.g., location of the origin) specify a coordinate reference system, or CRS, a complete set of assumptions used to translate the latitude and longitude information into a two dimensional map. An sf object often includes a default CRS, as illustrated below:

st_crs(oz_votes)
## Coordinate Reference System:
##   EPSG: 4283 
##   proj4string: "+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs"

As this output illutrates, the CRS can be described in two different ways. You can either use a numeric “EPSG code” (see http://www.epsg-registry.org/) or you can use the “proj4string”, a more verbose and (slightly) more human-readable format. In ggplot2, the CRS is controlled by coord_sf() to ensure that every layer in the plot uses the same projection. By default, coord_sf() uses the CRS associated with the geometry column of the data (or, if there are multiple data sets with a different associated CRS, the first layer in the plot to do so). Because sf data typically supply a sensible choice of CRS, this process usually unfolds invisibly, requiring no intervention from the user. However, should you need to set the CRS yourself, you can specify the crs parameter. Some careful thought is required to do so, however, as the plot below illustrates:

ggplot(oz_votes) +
  geom_sf() + 
  coord_sf(crs = st_crs("+proj=lcc +datum=WGS84")) + 
  theme_slum("sky")

In this map I have used the WGS84 datum (quite reasonably) and Lambert conformal conic (LCC) projection, which is often used in aeronautical applications because straight lines on the map are approximate “great circles” on the globe, and it is generally considered a good projection for regional maps in the middle latitudes. However, the map looks terrible because I haven’t set the parameters very wisely. To fix this, I centre the map at longitude 140 and latitude -25, and fix the two standard parallels (lines at which there are no distortions) at latitudes -18 and -36.

crs <- "+proj=lcc +datum=WGS84 +lat_0=-25 +lon_0=140 +lat_1=-18 +lat_2=-36"
ggplot(oz_votes) +
  geom_sf() + 
  coord_sf(crs = st_crs(crs)) + 
  theme_slum("sky")

To an Australian reader this particular projection might look very familiar, as it is widely used in various government applications (I think!)

Working with sf data

One reason to prefer simple features over other representations of spatial data is that geographical units can have complicated structure. A good example of this in the Australian maps data is the electoral disctrict of Eden-Monaro. To see why, the code below plots the electoral map of Australia, zooming in on Eden-Monaro:

p <- ggplot(oz_votes) +   
  geom_sf(aes(fill = NAME == "Eden-Monaro"), show.legend = FALSE) + 
  theme_slum("sky")

p + coord_sf(xlim = c(147.75, 150.25), ylim = c(-37.5, -34.5)) 

p + coord_sf(xlim = c(150, 150.25), ylim = c(-36, -36.3)) 

As this illustrates, Eden-Monaro is defined in terms of two distinct polygons, a large one on the Australian mainland and a small island. However, the large region has a hole in the middle (the hole exists because the Australian Capital Territory is a distinct political unit that is wholly contained within Eden-Monaro, and as illustrated above, electoral boundaries in Australia do not cross state lines). In sf terminology this is an example of a MULTIPOLYGON geometry. In this section I’ll talk about the structure of these objects and how to work with them

First, let’s use dplyr to grab only the geometry object:

edenmonaro <- oz_votes %>% 
  filter(NAME == "Eden-Monaro") %>% 
  pull(geometry)

The metadata for the edenmonaro object can accessed using helper functions. For example, st_geometry_type() extracts the geometry type (e.g., MULTIPOLYGON), st_dimension() extracts the number of dimensions (2 for XY data, 3 for XYZ), st_bbox() extracts the bounding box as a numeric vector, and st_crs() extracts the CRS as a list with two components, one for the EPSG code and the other for the proj4string. For example:

st_bbox(edenmonaro)
##      xmin      ymin      xmax      ymax 
## 147.68741 -37.50503 150.23068 -34.53558

Normally when we print the edenmonaro object the output would display all the additional information (dimension, bounding box, geodetic datum etc) but for the remainder of this section I will show only the relevant lines of the output. In this case edenmonaro is defined by a MULTIPOLYGON geometry containing one feature:

edenmonaro 
## Geometry set for 1 feature 
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 147.6874 ymin: -37.50503 xmax: 150.2307 ymax: -34.53558
## epsg (SRID):    4283
## proj4string:    +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
## MULTIPOLYGON (((150.2307 -36.24468, 150.2287 -3...

However, we can “cast” the MULTIPOLYGON into the two distinct POLYGON geometries from which it is constructed using st_cast():

st_cast(edenmonaro, "POLYGON")
## Geometry set for 2 features 
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 147.6874 ymin: -37.50503 xmax: 150.2307 ymax: -34.53558
## epsg (SRID):    4283
## proj4string:    +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
## POLYGON ((150.2307 -36.24468, 150.2287 -36.2490...
## POLYGON ((148.1345 -36.74374, 148.1366 -36.7393...

To illustrate when this might be useful, consider the Dawson electorate, which consists of 69 islands in addition to a coastal region on the Australian mainland.

dawson <- oz_votes %>% 
  filter(NAME == "Dawson") %>% 
  pull(geometry)
dawson
## Geometry set for 1 feature 
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 146.761 ymin: -21.21307 xmax: 149.9114 ymax: -19.18582
## epsg (SRID):    4283
## proj4string:    +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
## MULTIPOLYGON (((147.8981 -19.85285, 147.899 -19...
ggplot(dawson) + 
  geom_sf() +
  coord_sf() + 
  theme_slum("sky")

Suppose, however, our interest is only in mapping the islands. If so, we can first use the st_cast() function to break the Dawson electorate into the constituent polygons, and then use st_area() to calculate the area of each polygon:

dawson <- st_cast(dawson, "POLYGON")
st_area(dawson)
## Units: [m^2]
##  [1] 1.909892e+05 5.882283e+05 2.635586e+07 1.758654e+05 1.129020e+06
##  [6] 3.003666e+05 2.092691e+05 4.577543e+05 1.466749e+05 9.125599e+05
## [11] 1.173029e+06 1.212640e+06 2.503836e+05 2.810581e+05 7.973221e+05
## [16] 1.749108e+05 3.229309e+05 2.677653e+06 3.489708e+06 2.344368e+05
## [21] 1.894026e+05 4.841221e+05 2.024420e+06 4.763952e+05 1.895080e+05
## [26] 6.616333e+06 5.354126e+07 4.550664e+06 1.541931e+07 6.376682e+06
## [31] 3.970536e+06 8.190446e+05 6.879306e+05 1.360343e+05 2.928053e+06
## [36] 6.173295e+05 3.807525e+05 4.198479e+05 2.696862e+06 2.387335e+06
## [41] 7.664962e+05 1.057469e+08 1.030827e+07 7.595186e+06 4.434614e+06
## [46] 4.437562e+05 1.964999e+05 2.147435e+06 6.837429e+05 1.535143e+06
## [51] 8.109296e+05 2.113259e+06 1.899466e+05 1.409044e+05 1.092670e+07
## [56] 1.466357e+06 1.691391e+06 5.323031e+06 4.408680e+06 1.028809e+07
## [61] 6.486412e+06 3.921074e+05 1.088608e+06 3.998037e+05 3.434547e+05
## [66] 4.227004e+05 4.173320e+06 1.084524e+05 1.106814e+10 3.416470e+07

The large mainland region corresponds to the 69th polygon within Dawson. Armed with this knowledge, we can draw a map showing only the islands:

ggplot(dawson[-69]) + 
  geom_sf() + 
  coord_sf() + 
  theme_slum("sky")

I find this kind of neat in and of itself, but in a real life mapping application you might want go further and fill each island differently as a way of conveying information differently, providing annotations etc. This is a lot easier to do in the unpacked format, because each island is now its own object in the data.

Intermission

At this point I’ve reached the end of where I’m up to in the ggplot2 pull request. There’s a lot of different directions the book might go from there, and the direction the book goes could depend on all sorts of things. Among other things, I’m not sure that the ggplot2 book is necessarily the best place to launch into a detailed description of simple features and the sf package – those topics aren’t really about ggplot2 or its associated ecosystem, so I haven’t added anything like that to the pull request I sent to Hadley.

That being said… I came to this chapter as a total novice, with only a bare minimum knowledge about mapping, and I can’t help but notice that I really needed a basic tutorial on these topics to be able to get a “feel” for what I was doing. Maybe other people don’t find this sort of thing necessary, but I did. So, for those of you who think like me, I’ll take a deeper dive into what “simple features” are, and how they are supported by the sf package…




Made off
Don’t stray
My kind’s your kind
I’ll stay the same
Pack up
Don’t stray
    - Maps, by Yeah Yeah Yeahs

Part 2: What is a “simple features” object anyway?

Spatial geometries and the “well known” text

To get a good feel for what “simple features” objects actually are, let’s have a go at building these objects manually. The key idea in simple features is that every object is represented by a “geometry”, and there are a number of different types of geometries. The simplest kind of geometry is a POINT, which is defined by co-ordinate values for a single dot. This can be defined in 2D, 3D or 4D but I’ll keep it simple and use two dimensions! To define a point formally, we can use the st_point() function from the sf package:

point <- c(0, 0)
st_point(x = point)
## POINT (0 0)

The format of the text here isn’t arbitrary. It adheres to the “well known text” (WKT) specification that is typically used to describe simple features objects. The name of the geometry POINT appears first, followed by the data, which in this case are just two numbers in parentheses (0 0)

We can build other kinds of geometries by passing a matrix as the data source. For instance, if we want our geometry to be a set of unconnected points, we could use the MULTIPOINT geometry…

points <- cbind(
  x = c(0,  0, 10), 
  y = c(0, 10, 10)
)
st_multipoint(points)
## MULTIPOINT (0 0, 0 10, 10 10)

Alternatively, if we wanted those points to be connected into a line we would use the LINESTRING geometry:

st_linestring(points)
## LINESTRING (0 0, 0 10, 10 10)

Again notice the “well known text” format used to print the output. In both cases the data part of the string takes the form of a comma separated list of coordinates enclosed in parentheses, (0 0, 0 10, 10 10). Before the data string, however, is the text LINESTRING or MULTIPOINT which tells you what kind of geometry is involved. To illustrate – because honestly I need lots of pictures to stay sane when doing anything with sf – compare the following output:

points %>% st_multipoint() %>% plot(col = "white", lwd = 2)

points %>% st_linestring() %>% plot(col = "white", lwd = 2)

So far, so good. Now suppose we want to connect the points into a POLYGON, so that we end up drawing a triangle. The st_polygon() function will do that for us, but we need to do a little more work: polygons are deceptively complicated entities. For starters, the simple features approach is picky and expects you to close the polygon yourself, so we actually need four points to define our triangle, where the final point is identical to the first one:

points <- cbind(x = c(0, 0, 10, 0), y = c(0, 10, 10, 0))
st_polygon(list(points))
## POLYGON ((0 0, 0 10, 10 10, 0 0))

That makes some sense, but there are two rather counterintuitive things about this code and output. First, notice that I wrapped the points matrix inside a list. On first inspection that seems weird and unnecessary. Second, notice that while the geometry string is POLYGON as you might expect, the data string contains two parentheses ((0 0, 0 10, 10 10, 0 0)). It turns out that these two things are necessary and they are related to one another, which I’ll explain in a moment.

But first, I feel the urge to draw a picture again, so let’s check that this strange creation does in fact correspond to the triangle we’re expecting:

list(points) %>% st_polygon() %>% plot(col = "white", lwd = 2)

That all looks good, so let’s go back a step. The reason why I had to enclose the points matrix in a list, and why the resulting WKT string used two parentheses is that polygons can have holes, and those holes themselves have to be defined by a distinct set of points. So let’s do that quickly:

hole <- cbind(
  x = c(3, 3, 6, 3), 
  y = c(4, 7, 7, 4)
)

poly <- list(points, hole)
st_polygon(poly)
## POLYGON ((0 0, 0 10, 10 10, 0 0), (3 4, 3 7, 6 7, 3 4))

Now it makes a little more sense! The general form of a polygon object in the WKT specification is POLYGON ((outer), (hole1), (hole2), etc) where outer, hole1 etc are comma separated lists of points. In R, the corresponding representation treats outer, hole1 etc as matrices, and the polygon as a whole is a list of these matrices.

poly %>% st_polygon() %>%  plot(col = "white", lwd = 2)

Yay! We’re making progress!

At each step in this process, we’re making our geometries a little more complex. A POINT geometry is a single dot, and a MULTIPOINT geometry is a collection of these dots. A LINESTRING geometry adds connectivity between the points, and a POLYGON adds closure, fills, and holes. The simple features standard lets us extend this idea further. In the same way that a POINT is extended by MULTIPOINT, the LINESTRING and POLYGON geometries are extended by the MULTILINESTRING and MULTIPOLYGON geometries respectively.

To start with, let’s see what a MULTILINESTRING looks like. A single line is defined by a matrix, and as a consequence collection of lines is represented by a list of matrices:

line1 <- cbind(
  x = c(4, 5, 6),
  y = c(2, 1, 2)
)
line2 <- line1 + 2

multiline <- st_multilinestring(
  list(line1, line2)
)
multiline
## MULTILINESTRING ((4 2, 5 1, 6 2), (6 4, 7 3, 8 4))

The WKT string displayed here encloses each line within parentheses, and the collection of lines is also enclosed within parentheses. So, just ,like we saw with POLYGON objects, a MULTILINESTRING geometry is specified with two “layers” of parentheses. To see what we have created:

plot(multiline, col = "white", lwd = 2)

Okay, that looks about right, so let’s keep moving…

Let’s say we have these three matrices in R, and our goal is to use outer1 and hole1 to specify a POLYGON that contains a single hole, but to have outer2 correspond to a different POLYGON, and then bind these together into a single MULTIPOLYGON object:

outer1
##       x  y
## [1,]  0  0
## [2,]  0 10
## [3,] 10 10
## [4,]  0  0
hole1
##      x y
## [1,] 3 4
## [2,] 3 7
## [3,] 6 7
## [4,] 3 4
outer2
##       x  y
## [1,]  5 11
## [2,]  5 13
## [3,] 10 13
## [4,]  5 11

Here’s how we do that:

poly1 <- list(outer1, hole1)
poly2 <- list(outer2)

multipolygon <- st_multipolygon(
  list(poly1, poly2)
)

multipolygon
## MULTIPOLYGON (((0 0, 0 10, 10 10, 0 0), (3 4, 3 7, 6 7, 3 4)), ((5 11, 5 13, 10 13, 5 11)))

The output is starting to look a little less pleasant, but when we look closely it does make some sense! In our previous example, each POLYGON object required two layers of parentheses in order to represent holes unambiguously, so as a consequence the MULTIPOLYGON geometry requires a third layer of nesting, in order to capture multiple polygons within the same geometry. As a consequence our WKT string now begins with MULTIPOLYGON (((, which intimidated the hell out of me the first time I encountered it but feels quite natural to me now that I understand why I’m seeing so many parentheses.

Anyway, let’s take a look at the object we just created:

multipolygon %>%
  st_multipolygon() %>%
  plot(col = "white", lwd = 2)

Yep, looks like a pair of polygons to me!

At this point we have this as our set of geometries. First there are our three “basic” types:

  • POINT
  • LINESTRING
  • POLYGON

Then there are three geometries that combine geometries of the same type:

  • MULTIPOINT
  • MULTILINESTRING
  • MULTIPOLYGON

This isn’t an exhaustive listing by any means – because of course it’s not, nothing about geospatial is ever simple – and it turns out that there are a total of 17 different geometries that comprise the simple features standard. In practice however, the vast majority of applications only rely on seven of the 17 geometries. Of those seven, I’ve already described six. The last one that you’d be likely to encounter a lot is the GEOMETRYCOLLECTION, which allows you to combine geometries of different types.

st_geometrycollection(
  list(multipolygon, multiline)
)
## GEOMETRYCOLLECTION (MULTIPOLYGON (((0 0, 0 10, 10 10, 0 0), (3 4, 3 7, 6 7, 3 4)), ((5 11, 5 13, 10 13, 5 11))), MULTILINESTRING ((4 2, 5 1, 6 2), (6 4, 7 3, 8 4)))

This looks slightly different to the other WKT strings we’ve seen previously. It has to, because of the very fact that a geometry can contain other geometries of different kinds. So now the WKT string begins with GEOMETRY( but then each of the separate geometries is the complete WKT string for that constituent part (e.g., MULTIPOLYGON (((, etc.). In total there are now four layers of nesting! Let’s take a look at the chimera we have created:

list(multipolygon, multiline) %>%
  st_geometrycollection() %>%
  plot(col = "white", lwd = 2)

Aw, she’s cute. I’ll call her “flappy”

What madness have we wrought?

Up to this point we’ve been using functions from the sf package to convert base R classes (vectors, matrices and lists) into different kinds of simple features geometries. We’ve been printing these objects out (generating the WKT string) and plotting these objects (generating the pretty pictures) so it’s probably useful to know what these objects are. So, flappy, what are you?

flappy <- list(multipolygon, multiline) %>%
  st_geometrycollection()

class(flappy)
## [1] "XY"                 "GEOMETRYCOLLECTION" "sfg"

Remember how I said that everything about geospatial is hard? Well this is no exception. Our flappy variable has three S3 classes: it is an XY object because the co-ordinates are defined in two dimensional space, but it is also a GEOMETRYCOLLECTION object indicating the kind of simple features geometry that flappy is made from. Finally, it is an object of class sfg, which – shockingly – indicates that flappy is in fact a simple features geometry. Okay, yeah, that does make some sense.

From geometries to sf objects

We’re getting much closer to the kind of data that you might encounter in real life. For instance, in the ozdata examples above, an Australian state is represented as a single MULTIPOLYGON object, and the nation as a whole is a column of such objects. To do that with flappy, we use the st_sfc() function:

geometry <- st_sfc(flappy - 20, flappy + 10, flappy + 40)
geometry
## Geometry set for 3 features 
## geometry type:  GEOMETRYCOLLECTION
## dimension:      XY
## bbox:           xmin: -20 ymin: -20 xmax: 50 ymax: 53
## epsg (SRID):    NA
## proj4string:    NA
## GEOMETRYCOLLECTION (MULTIPOLYGON (((-20 -20, -2...
## GEOMETRYCOLLECTION (MULTIPOLYGON (((10 10, 10 2...
## GEOMETRYCOLLECTION (MULTIPOLYGON (((40 40, 40 5...

Now we have an object that, when printed, looks rather similar to some of the output we saw earlier. Our geometry variable is an sfc object, to which we could attach a coordinate reference system if we wanted:

class(geometry)
## [1] "sfc_GEOMETRYCOLLECTION" "sfc"

In fact, it’s probably a good idea to do exactly this. As it stands, the geometry object has no units attached to it. How far is 20? Where is it? At the moment we don’t know. For simplicity, let’s assume the arbitrary numbers I’ve been typing in for flappy are in fact longitude and latitude values. To express this idea, we set the cordinate reference system using st_crs() and by doing so we attach flappy to the surface of the Earth…

st_crs(geometry) <- st_crs("+proj=longlat +datum=WGS84") 
geometry
## Geometry set for 3 features 
## geometry type:  GEOMETRYCOLLECTION
## dimension:      XY
## bbox:           xmin: -20 ymin: -20 xmax: 50 ymax: 53
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## GEOMETRYCOLLECTION (MULTIPOLYGON (((-20 -20, -2...
## GEOMETRYCOLLECTION (MULTIPOLYGON (((10 10, 10 2...
## GEOMETRYCOLLECTION (MULTIPOLYGON (((40 40, 40 5...

The final step in creating an sf object is to turn it into something like a data frame, in which the geometry variable we’ve just built is one of the columns and other attributes are represented by different columns:

flappy_sf <- st_sf(
  species = c("flappy", "flappy", "flappy"),
  name = c("Alice", "Betty", "Carla"),
  geometry = geometry
)

flappy_sf
## Simple feature collection with 3 features and 2 fields
## geometry type:  GEOMETRYCOLLECTION
## dimension:      XY
## bbox:           xmin: -20 ymin: -20 xmax: 50 ymax: 53
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##   species  name                       geometry
## 1  flappy Alice GEOMETRYCOLLECTION (MULTIPO...
## 2  flappy Betty GEOMETRYCOLLECTION (MULTIPO...
## 3  flappy Carla GEOMETRYCOLLECTION (MULTIPO...

At this point, we can create a “real” map from flappy…

ggplot(flappy_sf, aes(fill = name)) + 
  geom_sf() + 
  coord_sf() + 
  theme_slum("sky")

That’s a lot of work to create one very silly map, but it is a real map, and we can work with it using the exact same tool set that I discussed at the start of the blog post…

Epilogue: Some maps are ridiculous

… I wouldn’t recommend it though

ggplot(flappy_sf, aes(fill = name)) + 
  geom_sf() + 
  geom_sf(
    data = oz_states, 
    show.legend = FALSE, 
    fill = "white"
  ) +
  coord_sf(crs = st_crs(crs)) + 
  theme_slum("sky")


  1. Tibet never appeared as a distinct unit in any of the sources, which is sad in itself but I’ll admit to a certain shameful relief that I didn’t have to make a decision.

  2. The use of theme_slum(), of course, has nothing to do with the map itself – I just wanted the output to match the colour scheme of this blog post!