Days 16-17: Purrr-fection in film: The Princess Bride

Since the invention of the film, there have been five films that were rated the most passionate, the most pure. The Princess Bride left them all behind.

I have been hesitant to tackle the purrr package in these posts. Functional programming is a thing I admire and cherish, but it’s also an art form in which I am - at best - an apprentice. Yes, it’s true that from my early days as a Matlab programmer I understood the value of vectorising one’s code, and this carried over into R. I’ve been using lapply, sapply and apply from the beginning, but these aren’t the easiest tools to work with, and arguably they don’t really encourage you to think cleanly about the mapping function that your’re defining whenever you “vectorise” a loop. Making the jump to thinking about my code in truly functional terms is a little scary, but I will admit that the purrr package has really helped.

So much so that I started using it without even intending to.

Honestly, all I wanted to do this weekend was find a way to parse the script to the greatest film in history – which is of course The Princess Bride and I will fight to the death over this – because if R is going to have the wonderful janeaustenr package it would be an offence against human dignity if princessbridr didn’t eventually end up on CRAN.

While I absolutely concede that I’m not worthy of the task, I thought I’d make a start. A few seconds of googling uncovered this page which shows the 1987 screen play in traditional screenplay formatting.

My initial goals: parse this page, extract a data frame that captures the structure of the script, and allows export of the data into a variety of formats including the original screenplay structure. Okay, as you wish…


Putting all distractions aside, this seems simple enough. I take a quick look at the source code to the HTML page, and it turns out that the script itself is wrapped inside a pre tag, so my first job is to strip out everything that isn’t related to the script:

script <- readLines("", warn = FALSE)

start <- script %>% str_detect(pattern = fixed("<pre>")) %>% which
stop <- script %>% str_detect(pattern = fixed("</pre>")) %>% which

script <- script[(start+1):(stop-1)]

Yep! Easy enough. Next up, working out how to think about the script itself!

Screenplays have structure that is expressed through the formatting. I want to retain the structure but separate the visual styling from the representation. The styling will be added later either through a print method or through CSS after exporting the script to HTML. Some references I relied on to understand the structure of a screen play - here and here.

My slightly revised goal is to write a screenplay_parse function that reads this input and converts it into a more structured CSV file. For added points, I’m going to try to use tools from the purrr package to do this without ever writing a loop.

common sense: We’ll never survive!

the alcohol: Nonsense! You’re only saying that because no-one ever has.

The resulting function is in the princess-bride-parse.R file, and looks like this:

screenplay_parse <- function(script) {
  # start by converting to a list of pages. the pages are 
  # marked in the text with "-------- etc" so we'll use that.
  # first, find the breaks
  page_sep <- "--------------------"
  page_breaks <- script %>% str_detect(fixed(page_sep)) %>% which
  # the easiest way to do this is loop through page breaks, adding to
  # a list inside the loop. ideally though we'd vectorise this. Normally
  # I'd use lapply, sapply, etc, but I should probably start learning
  # the tidyverse versions in purrr
  script <- map2(
    .x = c(1, page_breaks),
    .y = c(page_breaks, length(script)),
    .f = function(x,y) {script[(x+1):(y-1)]} # omit page break lines
  # within a script there are conceptually different parts that are 
  # formatted differently. there are "scene headers" & "action" on
  # the left, "character" "dialog", "parenthetical" and "extension"
  # in the middle, and "transition" and "shot" on the right. this
  # division isn't arbitrary: left is "place", center is "person"
  # and right is "technical", i guess. but let's not be presumptuous
  # i'll call it "left", "center" and "right" type; i'll; use that
  # to label "blocks" of conceptually related content
  script %<>%
    map(function(page_str) {
      # how many lines in this page?
      n <- length(page_str)
      # how long is each line
      linelength <- page_str %>% str_length
      # count the left-side white space
      whitespace <- page_str %>% 
        str_extract("^ *") %>%  # regex extracting whitespace
      # create semantically meaningful names
      type <- character(n)
      type[linelength == 0] <- "sep"
      type[whitespace == 44 & page_str %>% str_detect(fixed(":"))] <- "transition"
      type[whitespace == 44 & page_str %>% str_detect(fixed("--"))] <- "shot"
      type[whitespace == 25] <- "character" 
      type[whitespace == 25 & page_str %>% str_detect(regex("\\(.*\\)"))] <- "character +ext"
      type[whitespace == 18] <- "parenthetical"
      type[whitespace == 13] <- "dialog"
      # scene headers and action are semantically distinct, but have identical
      # white space information (i.e., 0). we can distinguish them by
      # observing that the header must appear before any non-separator in the 
      # page. however, for now I'm just going to label it all "action" because
      # TPB doesn't seem to consistently use page headers??
      type[whitespace == 0 & linelength != 0] <- "action"
      # ^^ side note... amusingly, I'd originally thought that the page breaks 
      # were scene breaks, and the ability to capture the semantics of scene
      # breaks was the main reason I wanted to process the script as a list of 
      # pages (i.e., scenes) before collapsing to a data frame, and it turns out 
      # I don't really need that. <sigh>
      # record what "type" of block each content line refers to
      blocktype <- character(n)
      blocktype[type == "sep"] <- NA
      blocktype[type %in% c("action","header")] <- "left"
      blocktype[type %in% c("character","character+ext","parenthetical","dialog")] <- "middle"
      blocktype[type %in% c("transition","shot")] <- "right"
      # now strip out the whitespace
      content <- page_str %>% str_trim
      # return a data frame containing content & position
        content = content,
        page = NA, # to be filled
        type = type,
        whitespace = whitespace,
        blocktype = blocktype,
        blocknum = NA, # to be filled
        stringsAsFactors = FALSE # <sigh>
  # each line should be labelled with the page number so we'll need
  # to insert this into the frame. I could use map2 for this, but
  # imap is a convenient shorthand:
  script %<>%
      function(df,index) {
        df$page <- index-1
  # now collapse to data frame
  script %<>% reduce(rbind)
  # the title isn't a scene, so it has it's own "type"
  script$type[ script$page == 0 & script$type != "sep"] <- "titleinfo" 
  script$blocktype[ script$page == 0 & script$type != "sep"] <- "title" 
  # now label the block numbers
  ind <- script$type != "sep"
  nlines <- sum(ind)
  script$blocknum <- as.numeric(script$blocknum)
  script$blocknum[ind][-1] <-                                         # new block if...
    ((script$type[ind][-1] %in% c("character","character-ext")) |        # ... a character type appears OR
    (script$blocktype[ind][-1] != script$blocktype[ind][-nlines])) %>%   # ... there is change in blocktype
  script$blocknum[ind][1] <- 0 
  # define it as S3 class "screenplay"
  class(script) <- c("screenplay","data.frame")

As ugly as that all looks, I’m really proud of myself for getting here. The lines calling map, map2, imap and reduce functions are all operations that I probably would have struggled to vectorise using the base R functions, even though they totally can be. It’s just that purrr made it clear enough what I was supposed to be doing that I actually managed to get it done without saying “screw it” and writing a for loop.

Anyway, I can now pass my script through the parser:

script %<>% screenplay_parse

The output is a data frame, which – after exporting to a CSV file – looks like this. So…

Yeah, more celebrations!

It has a content column that contains the relevant line from the script, a whitespace column indicating how many spaces need to be added to that line in the screen play, a page column indicating which page the line appeared on in the script, a type column indicating what kind of content is described on that line (e.g. dialog, parenthetical, transition, etc), a blocktype column that (broadly) describes where on the page the content is meant to go (left, middle, right), and a blocknum variable that groups together conceptually related content (e.g., the character is grouped with the relevant dialog). This data structure isn’t perfect (e.g., in many cases the transition direction is linked to an associated action in ways that the data frame does not reflect, and as it stands the data frame doesn’t properly capture the scene header/slugline in a way that is dissociable from the action), but I think it’s a decent first pass!

Using this structure, the princess-bride-parse.R defines a print method for screenplay objects that takes a screenplay data frame and uses it to produce something that approximates the screen play formatting:

print.screenplay <- function(x, style="data", maxlines=50) {

  if(style == "data") {
  } else if (style == "screenplay") {
    # insert the whitespace
    str <- x$content %>% 
      str_pad(width = x$whitespace + str_length(x$content)) 
    # insert the page breaks
    nlines <- length(str)
    ind <- x$page[-1] != x$page[-nlines]
    str[ind] %<>% str_replace(
      pattern = "$", 
      replacement = glue("\n\n\n ",str_dup("-",25)," page ", str_dup("-",25))
    # cat the result to screen
    cat(str[1:min(maxlines,nlines)], sep = "\n")
    if(maxlines < nlines) {
      cat("\n\n[First ", maxlines, " of ", nlines, 
          " lines shown. Change 'maxlines' to show more]\n", 
          sep = "")
  } else if (style == "brief") {
  } else {

Call the function…

print(script, style = "screenplay")
##                     "The Princess Bride"
##                              by
##                        William Goldman
##                                          1987-Shooting Draft
## ------------------------- page -------------------------
## The game is in progress. As a sick coughing sound is heard.
##                                             CUT TO:
## lying in bed, coughing. Pale, one sick cookie. Maybe he's
## seven or eight or nine. He holds a remote in one hand,
## presses it, and the video game moves a little bit. Then he's
## hit by another spasm of coughing, puts the remote down.
## His room is monochromatic, greys and blues, mildly high-tech.
## We're in the present day and this is a middle class house,
## somewhere in the suburbs.
##                                             CUT TO:
## The Kid's MOTHER as she enters, goes to him, fluffs his
## pillows, kisses him, and briefly feels his forehead. She's
## worried, it doesn't show. During this
##                          MOTHER
##              You feeling any better?
##                          THE KID
##              A little bit.
##                          MOTHER
##              Guess what.
##                          THE KID
##              What?
##                          MOTHER
##              Your grandfather's here.
## [First 50 of 6731 lines shown. Change 'maxlines' to show more]

… and it uses the script data frame to print the script in a format that looks very similar to the original. The complete text file is here. Admittedly, that’s not a great acheivement. I’ve taken the nicely structured CSV file and used it to recreate the original script that I built it from. That’s nice, but it would be nice to illustrate how you can wrap any styling you like around it once you have the core structure!

I’m a little tipsy at this point in the evening, but let’s have a go at this. Can we make a simple web page that displays the script in a more “human friendly” fashion that emphasises the dialog without omitting the other information? Sure!

First, we have to write the CSV file to HTML, wrapping everything in spans and divs that we can style appropriately:

screenplay_tohtml <- function(script) {
  # remove separator lines
  script <- script[script$type != "sep",]
  nlines <- dim(script)[1]
  # <div>
  # open/close block variables
  script$open_block <- c(TRUE, script$blocknum[-1] != script$blocknum[-nlines])
  script$close_block <- c(script$blocknum[-nlines] != script$blocknum[-1],TRUE)
  # <span>
  # open/close dialog
  script$open_dialog <- c(FALSE, script$type[-1] == "dialog" & script$type[-nlines] != "dialog")
  script$close_dialog <- c(script$type[-nlines] == "dialog" & script$type[-1] != "dialog",FALSE)

  # open/close parenthetical
  script$open_parenthetical <- c(FALSE, script$type[-1] == "parenthetical" & script$type[-nlines] != "parenthetical")
  script$close_parenthetical <- c(script$type[-nlines] == "parenthetical" & script$type[-1] != "parenthetical",FALSE)
  # open/close character
  script$open_character <- c(FALSE, script$type[-1] %in% c("character","character+ext") & !(script$type[-nlines] %in% c("character","character+ext")))
  script$close_character <-c(script$type[-nlines] %in% c("character","character+ext") & !(script$type[-1] %in% c("character","character+ext")),FALSE)
  # open/close transition
  script$open_transition <- c(FALSE, script$type[-1] == "transition" & script$type[-nlines] != "transition")
  script$close_transition <- c(script$type[-nlines] == "transition" & script$type[-1] != "transition",FALSE)
  # open/close action
  script$open_action <- c(FALSE, script$type[-1] == "action" & script$type[-nlines] != "action")
  script$close_action <- c(script$type[-nlines] == "action" & script$type[-1] != "action",FALSE)
  # create html
  html <- script$content
  # add action span tags
  html[script$open_action] %<>% str_replace("^","<span class='action'>")
  html[script$close_action] %<>% str_replace("$","</span>")
  # add transition span tags
  html[script$open_transition] %<>% str_replace("^","<span class='transition'>")
  html[script$close_transition] %<>% str_replace("$","</span>")
  # add character span tags
  html[script$open_character] %<>% str_replace("^","<span class='character'>")
  html[script$close_character] %<>% str_replace("$","</span>")

  # add parenthetical span tags
  html[script$open_parenthetical] %<>% str_replace("^","<span class='parenthetical'>")
  html[script$close_parenthetical] %<>% str_replace("$","</span>")
  # add dialog span tags
  html[script$open_dialog] %<>% str_replace("^","<span class='dialog'>")
  html[script$close_dialog] %<>% str_replace("$","</span>")
  # add block divs
  html[script$open_block] %<>% str_replace("^","\n<div class='block'>\n")
  html[script$close_block] %<>% str_replace("$","\n</div>")
  html <- glue_collapse(html, sep="\n")

add_wrapper <- function(html) {
    "<link rel='stylesheet' type='text/css' href='playstyle.css'>",
    .sep = "\n"

That’s extremely boring, but we can now write a small CSS file, playstyle.css that tells the browser how to style the output! So now when I write the HTML to a file…

  script %>% screenplay_tohtml %>% add_wrapper,
  file = "./princess-bride-formatted.html"

… I get a version of the script that appears in nicely styled HTML, shown here



Danielle Navarro
Associate Professor of Cognitive Science