Day 10: Exploring #FolkloreThursday using rtweet

I have several colleagues and friends who use social media as a data source, and I’ve always wondered how they get the data and if I could do the same using R. The Twitter API, for instance, allows you some (limited) access to tweets, and in the past I have played around with the twitteR package to set up an R based twitter client. The rtweet package seems to be a slightly more recent version of the same thing? So let’s try it out!

First off, I created a twitter app. There are good instructions on how to do this in the package vignette, or alternatively there’s this post that provides a quick setup guide. It did require me to enter my mobile phone number into twitter, and I found that I had to disable callback locking (in the check boxes) in order to get an access token. But after I’d done that, this command worked perfectly:

token <- create_token(
  app = "my-app-name",          # not my real app name!
  consumer_key = "xxxxxxx",     # or my real consumer key
  consumer_secret = "xxxxxxx",  # and obviously, not my consumer secret!!!
  set_renv = TRUE
)

Once you have the token you can save it locally so that you don’t have to keep looking up your consumer key and secret in every session. I didn’t bother because I’m just playing around with this and not likely to use it again in the near future

Searching twitter for folklore

Besides #Rstats, my favourite hashtag on twitter is #FolkloreThursday. So I used the search_tweets function to find 1000 tweets that reference the hashtag.

tw <- search_tweets(
  q = "#FolkloreThursday",
  n = 1000
)

The result is a tibble with 1000 rows and 42 columns summarising the tweets:

tw
## # A tibble: 1,000 x 42
##    status_id  created_at user_id screen_name text  source reply_to_status…
##    <chr>      <chr>      <chr>   <chr>       <chr> <chr>  <chr>           
##  1 992697057… 2018-05-0… 284982… Alpha_Anne… RT @… Twitt… <NA>            
##  2 992693571… 2018-05-0… 421788… maicching8… RT @… Twitt… <NA>            
##  3 992692691… 2018-05-0… 117650… randomwalk  Illu… Twitt… <NA>            
##  4 992688768… 2018-05-0… 299963… DreadfulRed "RT … Twitt… <NA>            
##  5 992687644… 2018-05-0… 284155… Kolekcjone… RT @… Twitt… <NA>            
##  6 992685834… 2018-05-0… 235173… 1Atsuhimer… "RT … Twitt… <NA>            
##  7 992685178… 2018-05-0… 110359… hellohisto… "RT … Twitt… <NA>            
##  8 992685006… 2018-05-0… 189835… AndreaCCon… RT @… Twitt… <NA>            
##  9 992684008… 2018-05-0… 922038… AlaynasMot… "RT … Twitt… <NA>            
## 10 992681680… 2018-05-0… 749348… ATHE1STP0W… RT @… Twitt… <NA>            
## # ... with 990 more rows, and 35 more variables: reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, hashtags <chr>,
## #   symbols <lgl>, urls_url <chr>, urls_t.co <chr>,
## #   urls_expanded_url <chr>, media_url <chr>, media_t.co <chr>,
## #   media_expanded_url <chr>, media_type <chr>, ext_media_url <chr>,
## #   ext_media_t.co <chr>, ext_media_expanded_url <chr>,
## #   ext_media_type <lgl>, mentions_user_id <chr>,
## #   mentions_screen_name <chr>, lang <chr>, quoted_status_id <chr>,
## #   quoted_text <chr>, retweet_status_id <chr>, retweet_text <chr>,
## #   place_url <lgl>, place_name <lgl>, place_full_name <lgl>,
## #   place_type <lgl>, country <lgl>, country_code <lgl>, geo_coords <lgl>,
## #   coords_coords <lgl>, bbox_coords <lgl>

For instance, one row in the table corresponds to a twitter user retweeting this Folklore Thursday tweet from 2017:

The media_url variable in the tw tibble contains links to any images included in the tweets. So if I wanted to pull all of the images from the tweets, I could do something like this. First find all of the image URLs:

urls <- unlist(tw$media_url)
urls <- urls[!is.na(urls)]

Next, I created a vector of locations to save the images inside a convenient “tweets” folder

fnames <- gsub(".*/","tweets/",urls)

Finally, I downloaded the images!

for(i in 1:length(urls)) {
  download.file(url = urls[i], destfile=fnames[i])
}

This produces a bunch of images, some prosaic, some wonderful, but – as is always the case with Folklore Thursday – all kind of interesting. To give the original tweeters credit for their posts, instead of showing the automatically downloaded images, here are two lovely tweets by Alexandra Epps that I discovered by browsing the images…

To explore the hashtags, I used the wordcloud package (which was also new to me so yay! 🥂) to draw a pretty wordcloud:

tags<-
  tw$hashtags %>% 
  strsplit(split = " ") %>%
  unlist %>%
  tolower %>%
  table %>% 
  sort(decreasing = TRUE)

wordcloud(
  words=names(tags[-1]), 
  freq = tags[-1]
)

The tw tibble also contains information about what Twitter client was used to post the tweet, in the source column. Here’s what we get:

tab <- table(tw$source)
tab <- c(tab[tab>25], "Other"= sum(tab[tab<25]))
pie(x=tab)

Here’s the distribution of times at which the tweets were posted:

tw$created_at %>%
  ymd_hms %>%
  hist(breaks = 10, main = "", col = "#88398A", border = "white", xlab = "Time")

A puzzle

I feel like I’m still missing something important about how to pull information from the tweets though. One of the more endearing tweets that I found in my data set was this one:

There are obviously some images here – I mean, FART-RUNES!! – but here’s the relevant row from the tibble:

print(tw[4,],width = Inf)
## # A tibble: 1 x 42
##   status_id          created_at          user_id    screen_name
##   <chr>              <chr>               <chr>      <chr>      
## 1 992688768862818305 2018-05-05 08:53:40 2999633231 DreadfulRed
##   text                                                                    
##   <chr>                                                                   
## 1 "RT @GanferHaarFinn: Fretrúnir - FART-RUNES\nOne of the strangest (&amp…
##   source             reply_to_status_id reply_to_user_id
##   <chr>              <chr>              <chr>           
## 1 Twitter Web Client <NA>               <NA>            
##   reply_to_screen_name is_quote is_retweet favorite_count retweet_count
##   <chr>                <lgl>    <lgl>               <int>         <int>
## 1 <NA>                 F        T                       0            91
##   hashtags symbols urls_url urls_t.co urls_expanded_url media_url
##   <chr>    <lgl>   <chr>    <chr>     <chr>             <chr>    
## 1 <NA>     NA      <NA>     <NA>      <NA>              <NA>     
##   media_t.co media_expanded_url media_type ext_media_url ext_media_t.co
##   <chr>      <chr>              <chr>      <chr>         <chr>         
## 1 <NA>       <NA>               <NA>       <NA>          <NA>          
##   ext_media_expanded_url ext_media_type mentions_user_id
##   <chr>                  <lgl>          <chr>           
## 1 <NA>                   NA             2151784741      
##   mentions_screen_name lang  quoted_status_id quoted_text
##   <chr>                <chr> <chr>            <chr>      
## 1 GanferHaarFinn       en    <NA>             <NA>       
##   retweet_status_id 
##   <chr>             
## 1 992032329416798209
##   retweet_text                                                            
##   <chr>                                                                   
## 1 "Fretrúnir - FART-RUNES\nOne of the strangest (&amp; most horrible) cur…
##   place_url place_name place_full_name place_type country country_code
##   <lgl>     <lgl>      <lgl>           <lgl>      <lgl>   <lgl>       
## 1 NA        NA         NA              NA         NA      NA          
##   geo_coords coords_coords bbox_coords
##   <lgl>      <lgl>         <lgl>      
## 1 NA         NA            NA

None of the URL fields refer to the image at all. My first hypothesis was that the retweet (which is what my client actually found) doesn’t record the URL of the image? But that doesn’t make a lot of sense - the tweets for which the tw tibble did record the image URL were retweets too. My next thought was to scan the text of the tweets and compare that to whether there was a URL listed anywhere

linkInText <- str_detect(tw$text, fixed("http"))
hasMediaURL <- !is.na(tw$media_url)
table(hasMediaURL, linkInText)
##            linkInText
## hasMediaURL FALSE TRUE
##       FALSE   896   55
##       TRUE      0   49

That seems informative. Okay a closer look. Here’s the text of the tweet that I downloaded:

tw$text[4]
## [1] "RT @GanferHaarFinn: Fretrúnir - FART-RUNES\nOne of the strangest (&amp; most horrible) curses recorded in Icelandic sorcery involves casting a #…"

But there’s also a retweet_text field:

tw$retweet_text[4]
## [1] "Fretrúnir - FART-RUNES\nOne of the strangest (&amp; most horrible) curses recorded in Icelandic sorcery involves casting a #FartRunes spell.\nIn 1654 one man was burnt at the stake after admitting casting Fretrúnir on a local girl. (Lots more info in comments thread)\n#FolkloreThursday https://t.co/jjb6ngo0MQ"

The URL at the end takes you to the original tweet (i.e.,the one I embedded above). Presumably if I then used the rtweet client to download the original tweet, it would have the links to the images?

This seems to happen quite a bit. One of the tweets in tw is a retweet of this one:

As before, the text column within the retweet itself is truncated:

tw$text[20]
## [1] "RT @beakheads: Recent Green Man-themed writing for @elementumjournl reminds me of the many amazing carvings worth seeing @ExeterCathedral (…"

There is no mention in any of of the “URL related” columns of tw to an image, but as before the retweet_text column contains the link to the original tweet:

tw$retweet_text[20]
## [1] "Recent Green Man-themed writing for @elementumjournl reminds me of the many amazing carvings worth seeing @ExeterCathedral (photos by @Mark_Ware) #FolkloreThursday https://t.co/OqrL5c8CIc"

So I think there is something going involving truncation with retweets, but that’s maybe not the whole story? I’m not quite sure if there’s an easy fix – is there some argument I can pass to the Twitter API that would return the image links within the original tweet, or would that require a different query? I have a suspicion that Twitter might not let me pull this information? In any case, I’ve run short on time, so I’ll leave that to another time.

Wrapping up

This was fun! I guess I have a lot to learn about the Twitter API (note to self: here) but it’s definitely enjoyable!

Postscript

After doing a little more reading, it’s worth noting that the Twitter terms of service (as of May 6th!) do allow you to crawl the site, so long as the queries respect the robots.txt file; for creating derivative works (which, arguably, any data set compiled in the manner I did so here), you must adhere to the Developer Agreement and Developer Policy. I think that what I’ve posted here is consistent with these - I try to take ToS provisions seriously 😄

Avatar
Danielle Navarro
Associate Professor of Cognitive Science

Related