Image credit: Alexandra Badie, photo by Fabiana Badie, via Mermaid Society

Day 63-66: Learning to skim

The pace of these posts is definitely slowing! I had this one half-written two days ago, but then life got in the way. Partly prompted by external events, I took a day out of my schedule to organise some of my notes on transgender topics into a separate blog. I kind of resent having to do it, but there are so few transgender folks with tenured quantitative social science appointments that if we don’t make some of our notes on the literature public, unqualified fools are allowed to write absurd articles about us in the popular press. Sigh.

Then there was this whole thing where I had to make a tiny reorganisation to my school holiday plans, with the usual cascade of everything-depends-on-everything-else that makes a working parent’s life a scheduling nightmare:

But everything is “under control” now, at least for a very expansive definition of “control”. The kids are keeping themselves occupied, I’ve responded to some overdue emails, and I’m making inroads into the Saturday morning laundry mountain.

So… about that long-overdue R post…

Free association: Skim, board, gender

I seem to be the last person to the party on the skimr package. For folks like myself who have been living under a rock, it’s a tidyverse style package developed by some awesome rOpenSci people, intended to provide tools for summarising (or “skimming”) data sets in a quick and simple way. A lot of thought has gone into it - not only is it easy to use, it has very sensible default behaviour and is flexible and not difficulty to modify. It is fun!

In order to write a blog post, I wanted to find a data set to skim and some pretty pictures that I could include. Thankfully, it not take a lot of google skill to find something worthy. In the first instance, I was completely amazed to discover that skimming, or skimboarding is totally a thing in the surfing world and has been since the 1920s. The photo in the header is of Alexandra Badie and taken by Fabiana Badie, whose instagram feed is magnificent:

(She also sell prints at lagunasocal.com and given that I’m shamelessly linking to her gorgeous intagram feed here to make my boring stats post look prettier, I feel morally obligated to link to her shop too! 💸)

Next, I figured that since I’m already feeling grumpy about the low quality of public discussion of transgender topics, and I’ve segued from the skimr package to skimboarding, I thought I’d look at gender from someone else’s perspective and look at some data on the current numbers of cisgender women on corporate boards in Australia. I’ll confess this doesn’t come out of nowhere either. I follow the women in machine learning and data science twitter account (website), and as a result saw this retweeted onto my feed:

That’s a bit rubbish. Well, more than a bit actually 😡. So I became curious about the situation in Australia, and came across this report by the Australian Institute of Company Directors that reports on the number of women sitting on boards for all ASX200 companies. As usual, of course, the report only considers cisgender people and does not mention the existence of transgender and gender non-conforming people 😞. On the other hand, I think we can safely assume that the number of transgender people on the board is 0 for all 200 companies. With that minor annoyance dealt with, let’s take a look at the data.

Why is there no CSV file?

The report is a pdf document that contains all the raw data as a table that runs from pages 6 to 10; there does not seem to be any other publicly accessible version of the data. Even more annoyingly, there’s other text on some of those pages that doesn’t form part of the table. Sigh. Okay, pdf scraping time.

I’ll happily admit that I don’t really know all that much about how to work with pdf files, but thankfully the internet again came to my rescue: I found this medium post by Charles Bordet describing the use of pdftools to solve this problem. Thankfully, pulling all the text from the document is a one line command:

report_text <- pdftools::pdf_text("/path/to/location/only/I/have")

The pdf_text function returns a list with one element per page. Here’s the text from page 6:

report_text[[6]]
## [1] "6 GENDER DIVERSITY QUARTERLY REPORT – VOLUME 12\b\n  The full list of ASX 200 companies with the number of women on their boards is listed below. Individual chairs with an\n  asterisk next to their name are members of the 30% Club and have committed to achieving at least 30 per cent females on\n  their boards by 2018 or as soon as they can.\n                                                                                         No. of Female       % of Female\n                       ASX 200 Company                               Chair\n                                                                                           Directors          Directors\n   Fortescue Metals Group Ltd                       Andrew Forrest*                            5               55.6%\n   Medibank Private Limited                         Elizabeth Alexander                        5               55.6%\n   Altium Limited                                   Samuel Weiss*                              3               50.0%\n   Bapcor Limited                                   Andrew Harrison                            2               50.0%\n   Boral Limited                                    James Clark                                4               50.0%\n   MetCash Limited                                  Robert Murray                              4               50.0%\n   Mirvac Limited                                   John Mulcahy*                              4               50.0%\n   NIB Holdings Ltd                                 Steven Crane*                              4               50.0%\n   Nine Entertainment Co. Holdings Limited          Peter Costello                             3               50.0%\n   SEEK Limited                                     Neil Chatfield*                            3               50.0%\n   Spark New Zealand Limited                        Justine Smyth                              4               50.0%\n   Woolworths Group Ltd                             Gordon Cairns*                             4               50.0%\n   Unibail-Rodamco-Westfield                        Colin Dyer                                 5               45.5%\n   Coca-Cola Amatil Limited                         Ilana Atlas*                               4               44.4%\n   G8 Education Limited                             Mark Johnson*                              3               42.9%\n   GPT Group                                        Vickki McFadden                            3               42.9%\n   Incitec Pivot Limited                            Paul Brasher*                              3               42.9%\n   Navitas Limited                                  Tracey Horton*                             3               42.9%\n   Super Retail Group Limited                       Sally Pitkin*                              3               42.9%\n   Asaleo Care Limited                              Harry Boon*                                2               40.0%\n   Commonwealth Bank of Australia                   Catherine Livingstone*                     4               40.0%\n   Computershare Limited                            Simon Jones*                               4               40.0%\n   Insurance Australia Group Limited                Elizabeth Bryan                            4               40.0%\n   OZ Minerals Limited                              Rebecca McGrath*                           2               40.0%\n   Pendal Group Limited                             James Evans                                2               40.0%\n   Retail Food Group Limited                        Colin Archer                               2               40.0%\n   Trade Me Group Ltd                               David Kirk*                                2               40.0%\n   AGL Energy Limited                               Graeme Hunt*                               3                37.5%\n   APA Group                                        Michael Fraser                             3                37.5%\n   Aristocrat Leisure Limited                       Ian Blackburne                             3                37.5%\n   Caltex Australia Limited                         Steven Gregg*                              3                37.5%\n   Dexus Property Group                             Wallace Sheppard*                          3                37.5%\n   DuluxGroup Limited                               Peter Kirby                                3                37.5%\n   IRESS Limited                                    Anthony D'Aloisio                          3                37.5%\n   Link Administration Holdings Pty Limited         Michael Carapiet                           3                37.5%\n   Orica Limited                                    Malcolm Broomhead                          3                37.5%\n   Platinum Asset Management Ltd                    Michael Cole                               3                37.5%\n   Scentre Group Limited                            Brian Schwartz*                            3                37.5%\n   Stockland Corporation Ltd                        Thomas Pockett*                            3                37.5%\n   Suncorp Group Limited                            Zygmunt Switkowski*                        3                37.5%\n   WorleyParsons Limited                            John Grill                                 3                37.5%\n   Xero Limited Npv                                 Graham Smith                               3                37.5%\n"

Some extremely ugly code to strip out the irrelevant text and construct a list with one element per company containing a character vector with the data for that company…

boards_raw <- report_text[6:10] %>%
  unlist %>%
  gsub("^.* Directors\\n","", .) %>%
  paste(collapse="\n") %>%
  gsub("\\* Members.*$","", .) %>%
  str_split("\n") %>%
  unlist %>%
  str_trim %>%
  (function(x){x[nchar(x)>0]}) %>%
  str_replace_all(" 0.0%", " 0       0.0%") %>%
  str_replace_all("   *"," ####") %>%
  str_split("####") %>%
  map(.f = str_trim)

Some hacks to fix the three cases that fell through the gaps…

boards_raw[[61]] %<>% str_split("Limited") %>% unlist
boards_raw[[111]] %<>% str_split("Limited") %>% unlist
boards_raw[[162]] %<>% str_split("Ltd") %>% unlist

Tidy it up a bit and compute the relevant variables…

boards  <- boards_raw %>% as.data.frame %>% as.matrix %>% t %>% as.tibble
colnames(boards) <- c("company","chair","n_women","p_women")

boards$n_women %<>% as.numeric()
boards$p_women %<>% str_remove("%") %>% as.numeric() %>% (function(x){x/100})
boards$n <- round(boards$n_women / boards$p_women)
boards$n_men <- boards$n - boards$n_women

Okay, a data set! 🎉

To be honest I’m pretty unhappy with how ugly this code is, and somehow I’ve ended up losing one of the companies, but I don’t care. This is a blog post and I refuse to take the time to do it properly when there’s laundry to be done. 😀

Skimming the boards data

What do we have?

boards
## # A tibble: 199 x 6
##    company                      chair          n_women p_women     n n_men
##    <chr>                        <chr>            <dbl>   <dbl> <dbl> <dbl>
##  1 Fortescue Metals Group Ltd   Andrew Forres…       5   0.556     9     4
##  2 Medibank Private Limited     Elizabeth Ale…       5   0.556     9     4
##  3 Altium Limited               Samuel Weiss*        3   0.5       6     3
##  4 Bapcor Limited               Andrew Harris…       2   0.5       4     2
##  5 Boral Limited                James Clark          4   0.5       8     4
##  6 MetCash Limited              Robert Murray        4   0.5       8     4
##  7 Mirvac Limited               John Mulcahy*        4   0.5       8     4
##  8 NIB Holdings Ltd             Steven Crane*        4   0.5       8     4
##  9 Nine Entertainment Co. Hold… Peter Costello       3   0.5       6     3
## 10 SEEK Limited                 Neil Chatfiel…       3   0.5       6     3
## # ... with 189 more rows

The skim function provides a really nice summary. It separates variables in the tibble by type, and provides a sensible set of descriptive statistics for each. The thing I really like is that in addition to including means, standard deviations and quantiles for numeric variables, it also plots a miniature histogram.

skim(boards)
## Skim summary statistics
##  n obs: 199 
##  n variables: 6 
## 
## ── Variable type:character ──────────────────────────────────────────────────────────────────
##  variable missing complete   n min max empty n_unique
##     chair       0      199 199   4  26     0      181
##   company       0      199 199   8  48     0      199
## 
## ── Variable type:numeric ────────────────────────────────────────────────────────────────────
##  variable missing complete   n mean   sd p0  p25  p50  p75  p100     hist
##         n       5      194 199 7.52 1.7   4 6    7    9    12    ▅▇▇▇▆▃▁▁
##     n_men       5      194 199 5.4  1.38  2 4    5    6     9    ▁▂▆▇▇▃▂▁
##   n_women       0      199 199 2.07 1.02  0 1    2    3     5    ▁▇▁▇▆▁▂▁
##   p_women       0      199 199 0.27 0.11  0 0.17 0.29 0.33  0.56 ▁▂▇▅▇▃▁▂

I really like this. The summaries are succinct and easy to follow, so I can very quickly pull out a few things:

  • On average, an ASX200 board is about 27% women
  • The average number of women on a board is 2.1
  • The average number of men on a board is 5.4
  • The minimum number of men on a board is 2 (max 9)
  • The maximum number of women on a board is 5 (not surprisingly, min 0)
  • The size of a board ranges from 4 to 12.

The histograms are really nice, but definitely need to be put in context. They’re scaled for each variable separately so the histograms for n_women and n_men aren’t on the same scale (the function isn’t that smart). Similarly, the binning function has treated these as numeric variables, though that’s entirely my fault for not coding them as integer variables. If I do that I get this:

boards$n %<>% as.integer
boards$n_women %<>% as.integer
boards$n_men %<>% as.integer
skim(boards)
## Skim summary statistics
##  n obs: 199 
##  n variables: 6 
## 
## ── Variable type:character ──────────────────────────────────────────────────────────────────
##  variable missing complete   n min max empty n_unique
##     chair       0      199 199   4  26     0      181
##   company       0      199 199   8  48     0      199
## 
## ── Variable type:integer ────────────────────────────────────────────────────────────────────
##  variable missing complete   n mean   sd p0 p25 p50 p75 p100     hist
##         n       5      194 199 7.52 1.7   4   6   7   9   12 ▅▇▇▇▆▃▁▁
##     n_men       5      194 199 5.4  1.38  2   4   5   6    9 ▁▂▆▇▇▃▂▁
##   n_women       0      199 199 2.07 1.02  0   1   2   3    5 ▁▇▁▇▆▁▂▁
## 
## ── Variable type:numeric ────────────────────────────────────────────────────────────────────
##  variable missing complete   n mean   sd p0  p25  p50  p75 p100     hist
##   p_women       0      199 199 0.27 0.11  0 0.17 0.29 0.33 0.56 ▁▂▇▅▇▃▁▂

That separates the discrete valued variables from the continuous ones nicely. There is still some weirdness in the hisograms for the integer cases, though. If I tabulate the counts variables I get this:

table(boards$n_women)
## 
##  0  1  2  3  4  5 
##  5 60 69 50 12  3
table(boards$n_men)
## 
##  2  3  4  5  6  7  8  9 
##  1 10 43 55 46 24 11  4

What you can see (besides sexism) is the fact that the histogram-drawing code in the skim output hasn’t necessarily chosen the best bins for integer-valued data. That’s not a criticism of the skim function of course - it’s really quite remarkable to me that it does as well as it does - just a caution that the skim output isn’t a substitute for doing a more detailed exploration of the data!

In case you’re interested, here’s a simple jittered scatterplot to give a sense of how extreme the discrepancy is:

scatter <- boards %>%
  ggplot(aes(x = n_men, y = n_women)) +
  geom_jitter(width = .3, height = .3) +
  xlab("Number of Men") +
  ylab("Number of Women") +
  xlim(0,10) + ylim(0,10) +
  geom_abline(slope = 1, intercept = 0) +
  coord_equal()
plot(scatter)
## Warning: Removed 5 rows containing missing values (geom_point).

In a not-rubbish timeline, those dots should be scattered around the diagonal line. Suffice it to say, they are not. From eyeballing the figure, the most common board configuration appears to be 5 men and 1 woman. It’s easy enough to confirm. Here are the 11 most common gender compositions among corporate boards in ASX200 companies. I chose 11 because it’s only by the time you look at the 11th entry in the table you find a board composition with at least as many women as men 🙄

map2_chr(
  .x = boards$n_men, 
  .y = boards$n_women, 
  .f = function(x,y){paste0(x,"M+",y,"F")}
  ) %>% 
  table %>% 
  sort(decreasing = TRUE) %>%
  head(11)
## .
## 5M+1F 5M+2F 4M+2F 6M+2F 5M+3F 6M+1F 4M+1F 6M+3F 7M+3F 7M+2F 4M+4F 
##    21    18    17    16    15    14    13    12    10     9     6

Blegh. Here is an awesome skimboarding picture to cleanse the palate:

Understanding the package

The functionality of the skim function is richer than it looks. You can play around with the functions involved (“skimmers”) and modify it to your tastes. For instance, if I want to see what skimmers are used to describe numeric variables, I can use the show_skimmers function:

show_skimmers("numeric")
## $numeric
##  [1] "missing"  "complete" "n"        "mean"     "sd"       "p0"      
##  [7] "p25"      "p50"      "p75"      "p100"     "hist"

For a more detailed view, I can use get_skimmers. Instead of returning the names of the skimmer functions, it returns a list of functions.

I can use the skim_with function to alter the set of skimmers used. Setting a particular function to NULL removes it. So if I decided I don’t want to see the quantiles, I could do this:

skim_with(
  numeric = list(p0 = NULL, p25 = NULL, p50 = NULL, p75 = NULL, p100 = NULL),
  integer = list(p0 = NULL, p25 = NULL, p50 = NULL, p75 = NULL, p100 = NULL)  
)
skim(boards)
## Skim summary statistics
##  n obs: 199 
##  n variables: 6 
## 
## ── Variable type:character ──────────────────────────────────────────────────────────────────
##  variable missing complete   n min max empty n_unique
##     chair       0      199 199   4  26     0      181
##   company       0      199 199   8  48     0      199
## 
## ── Variable type:integer ────────────────────────────────────────────────────────────────────
##  variable missing complete   n mean   sd     hist
##         n       5      194 199 7.52 1.7  ▅▇▇▇▆▃▁▁
##     n_men       5      194 199 5.4  1.38 ▁▂▆▇▇▃▂▁
##   n_women       0      199 199 2.07 1.02 ▁▇▁▇▆▁▂▁
## 
## ── Variable type:numeric ────────────────────────────────────────────────────────────────────
##  variable missing complete   n mean   sd     hist
##   p_women       0      199 199 0.27 0.11 ▁▂▇▅▇▃▁▂

If I want to add a skimmer for skewness, I could use the skew function in the psych package:

skim_with(
  numeric = list(skew = psych::skew),
  integer = list(skew = psych::skew)
)
skim(boards)
## Skim summary statistics
##  n obs: 199 
##  n variables: 6 
## 
## ── Variable type:character ──────────────────────────────────────────────────────────────────
##  variable missing complete   n min max empty n_unique
##     chair       0      199 199   4  26     0      181
##   company       0      199 199   8  48     0      199
## 
## ── Variable type:integer ────────────────────────────────────────────────────────────────────
##  variable missing complete   n mean   sd     hist skew
##         n       5      194 199 7.52 1.7  ▅▇▇▇▆▃▁▁ 0.32
##     n_men       5      194 199 5.4  1.38 ▁▂▆▇▇▃▂▁ 0.38
##   n_women       0      199 199 2.07 1.02 ▁▇▁▇▆▁▂▁ 0.42
## 
## ── Variable type:numeric ────────────────────────────────────────────────────────────────────
##  variable missing complete   n mean   sd     hist skew
##   p_women       0      199 199 0.27 0.11 ▁▂▇▅▇▃▁▂ 0.12

How the histograms are made

One curiosity I had was how the histograms are being drawn. The skimmer function in specified in the package is inline_hist:

skimr::inline_hist
## function (x) 
## {
##     if (any(is.infinite(x))) {
##         x[is.infinite(x)] <- NA
##         warning("Variable contains Inf or -Inf value(s) that were converted to NA.")
##     }
##     if (length(x) < 1 || all(is.na(x))) {
##         return(structure(" ", class = c("spark", "character")))
##     }
##     if (all(x == 0, na.rm = TRUE)) 
##         x <- x + 1
##     hist_dt <- table(cut(x, options$formats$character$width))
##     hist_dt <- hist_dt/max(hist_dt)
##     structure(spark_bar(hist_dt), class = c("spark", "character"))
## }
## <bytecode: 0x7fbd79f6d748>
## <environment: namespace:skimr>

Okay, so the work is being done by spark_bar, which is not an exported function. However, that’s not going to stop me from taking a peek:

skimr:::spark_bar
## function (x, safe = TRUE) 
## {
##     stopifnot(is.numeric(x))
##     bars <- vapply(9601:9608, intToUtf8, character(1))
##     if (safe) {
##         bars <- bars[-c(4, 8)]
##     }
##     factor <- cut(x, breaks = seq(0, 1, length.out = length(bars) + 
##         1), labels = bars, include.lowest = TRUE)
##     chars <- as.character(factor)
##     chars[is.na(chars)] <- bars[length(bars)]
##     structure(paste0(chars, collapse = ""), class = "spark")
## }
## <bytecode: 0x7fbd793cd900>
## <environment: namespace:skimr>

Ah, clever. It’s using these Unicode characters:

vapply(9601:9608, intToUtf8, character(1))
## [1] "▁" "▂" "▃" "▄" "▅" "▆" "▇" "█"

Now I really want to write a skimmer that summarises data using emoji.

What does skim return?

Okay, I’m running out of time before the kids trip to the library, so I’d better wrap this up. The skim function returns an object of class skim_df that is essentially a tibble:

boards_skim <- skim(boards)
class(boards_skim)
## [1] "skim_df"    "tbl_df"     "tbl"        "data.frame"

Okay, so what does the underlying tibble look like? Let’s bypass the skim_df print method and print the object as a tibble:

boards_skim %>% (tibble:::print.tbl)
## # A tibble: 42 x 6
##    variable type      stat     level value formatted
##  * <chr>    <chr>     <chr>    <chr> <dbl> <chr>    
##  1 company  character missing  .all      0 0        
##  2 company  character complete .all    199 199      
##  3 company  character n        .all    199 199      
##  4 company  character min      .all      8 8        
##  5 company  character max      .all     48 48       
##  6 company  character empty    .all      0 0        
##  7 company  character n_unique .all    199 199      
##  8 chair    character missing  .all      0 0        
##  9 chair    character complete .all    199 199      
## 10 chair    character n        .all    199 199      
## # ... with 32 more rows

This has a nice property. It allows you to include skim as part of a pipe:

boards %>% skim %>% filter(stat=="hist")
## # A tibble: 4 x 6
##   variable type    stat  level value formatted
##   <chr>    <chr>   <chr> <chr> <dbl> <chr>    
## 1 n_women  integer hist  .all     NA ▁▇▁▇▆▁▂▁ 
## 2 p_women  numeric hist  .all     NA ▁▂▇▅▇▃▁▂ 
## 3 n        integer hist  .all     NA ▅▇▇▇▆▃▁▁ 
## 4 n_men    integer hist  .all     NA ▁▂▆▇▇▃▂▁

So that’s pretty handy too! Woohoo! I really like this package

Related