I have several colleagues and friends who use social media as a data source, and I’ve always wondered how they get the data and if I could do the same using R. The Twitter API, for instance, allows you some (limited) access to tweets, and in the past I have played around with the twitteR
package to set up an R based twitter client. The rtweet
package seems to be a slightly more recent version of the same thing? So let’s try it out!
First off, I created a twitter app. There are good instructions on how to do this in the package vignette, or alternatively there’s this post that provides a quick setup guide. It did require me to enter my mobile phone number into twitter, and I found that I had to disable callback locking (in the check boxes) in order to get an access token. But after I’d done that, this command worked perfectly:
token <- create_token(
app = "my-app-name", # not my real app name!
consumer_key = "xxxxxxx", # or my real consumer key
consumer_secret = "xxxxxxx", # and obviously, not my consumer secret!!!
set_renv = TRUE
)
Once you have the token you can save it locally so that you don’t have to keep looking up your consumer key and secret in every session. I didn’t bother because I’m just playing around with this and not likely to use it again in the near future
Searching twitter for folklore
Besides #Rstats, my favourite hashtag on twitter is #FolkloreThursday. So I used the search_tweets
function to find 1000 tweets that reference the hashtag.
tw <- search_tweets(
q = "#FolkloreThursday",
n = 1000
)
The result is a tibble with 1000 rows and 42 columns summarising the tweets:
tw
## # A tibble: 1,000 x 42
## status_id created_at user_id screen_name text source reply_to_status…
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 992697057… 2018-05-0… 284982… Alpha_Anne… RT @… Twitt… <NA>
## 2 992693571… 2018-05-0… 421788… maicching8… RT @… Twitt… <NA>
## 3 992692691… 2018-05-0… 117650… randomwalk Illu… Twitt… <NA>
## 4 992688768… 2018-05-0… 299963… DreadfulRed "RT … Twitt… <NA>
## 5 992687644… 2018-05-0… 284155… Kolekcjone… RT @… Twitt… <NA>
## 6 992685834… 2018-05-0… 235173… 1Atsuhimer… "RT … Twitt… <NA>
## 7 992685178… 2018-05-0… 110359… hellohisto… "RT … Twitt… <NA>
## 8 992685006… 2018-05-0… 189835… AndreaCCon… RT @… Twitt… <NA>
## 9 992684008… 2018-05-0… 922038… AlaynasMot… "RT … Twitt… <NA>
## 10 992681680… 2018-05-0… 749348… ATHE1STP0W… RT @… Twitt… <NA>
## # ... with 990 more rows, and 35 more variables: reply_to_user_id <chr>,
## # reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## # favorite_count <int>, retweet_count <int>, hashtags <chr>,
## # symbols <lgl>, urls_url <chr>, urls_t.co <chr>,
## # urls_expanded_url <chr>, media_url <chr>, media_t.co <chr>,
## # media_expanded_url <chr>, media_type <chr>, ext_media_url <chr>,
## # ext_media_t.co <chr>, ext_media_expanded_url <chr>,
## # ext_media_type <lgl>, mentions_user_id <chr>,
## # mentions_screen_name <chr>, lang <chr>, quoted_status_id <chr>,
## # quoted_text <chr>, retweet_status_id <chr>, retweet_text <chr>,
## # place_url <lgl>, place_name <lgl>, place_full_name <lgl>,
## # place_type <lgl>, country <lgl>, country_code <lgl>, geo_coords <lgl>,
## # coords_coords <lgl>, bbox_coords <lgl>
For instance, one row in the table corresponds to a twitter user retweeting this Folklore Thursday tweet from 2017:
The Mermaid and the Dragon.
— Lynne (@ghirlandia) June 29, 2017
Illustration by Warwick Goble.#FolkloreThursday pic.twitter.com/Dw1ursIr5z
The media_url
variable in the tw
tibble contains links to any images included in the tweets. So if I wanted to pull all of the images from the tweets, I could do something like this. First find all of the image URLs:
urls <- unlist(tw$media_url)
urls <- urls[!is.na(urls)]
Next, I created a vector of locations to save the images inside a convenient “tweets” folder
fnames <- gsub(".*/","tweets/",urls)
Finally, I downloaded the images!
for(i in 1:length(urls)) {
download.file(url = urls[i], destfile=fnames[i])
}
This produces a bunch of images, some prosaic, some wonderful, but – as is always the case with Folklore Thursday – all kind of interesting. To give the original tweeters credit for their posts, instead of showing the automatically downloaded images, here are two lovely tweets by Alexandra Epps that I discovered by browsing the images…
Rapunzel, Rapunzel, let down your hair…#BrothersGrimm #FlorenceHarrison #FolkloreThursday pic.twitter.com/Epmk8e0hOk
— Alexandra Epps (@ArtGuideAlex) April 19, 2018
But grandmother! What big teeth you have..
— Alexandra Epps (@ArtGuideAlex) May 3, 2018
Little Red Riding Hood #ArthurRackham #FolkloreThursday pic.twitter.com/qzzzXQfjYv
To explore the hashtags, I used the wordcloud
package (which was also new to me so yay! 🥂) to draw a pretty wordcloud:
tags<-
tw$hashtags %>%
strsplit(split = " ") %>%
unlist %>%
tolower %>%
table %>%
sort(decreasing = TRUE)
wordcloud(
words=names(tags[-1]),
freq = tags[-1]
)
The tw
tibble also contains information about what Twitter client was used to post the tweet, in the source
column. Here’s what we get:
tab <- table(tw$source)
tab <- c(tab[tab>25], "Other"= sum(tab[tab<25]))
pie(x=tab)
Here’s the distribution of times at which the tweets were posted:
tw$created_at %>%
ymd_hms %>%
hist(breaks = 10, main = "", col = "#88398A", border = "white", xlab = "Time")
A puzzle
I feel like I’m still missing something important about how to pull information from the tweets though. One of the more endearing tweets that I found in my data set was this one:
Fretrúnir - FART-RUNES
— G. H. Finn (@GanferHaarFinn) May 3, 2018
One of the strangest (& most horrible) curses recorded in Icelandic sorcery involves casting a #FartRunes spell.
In 1654 one man was burnt at the stake after admitting casting Fretrúnir on a local girl. (Lots more info in comments thread)#FolkloreThursday pic.twitter.com/jjb6ngo0MQ
There are obviously some images here – I mean, FART-RUNES!! – but here’s the relevant row from the tibble:
print(tw[4,],width = Inf)
## # A tibble: 1 x 42
## status_id created_at user_id screen_name
## <chr> <chr> <chr> <chr>
## 1 992688768862818305 2018-05-05 08:53:40 2999633231 DreadfulRed
## text
## <chr>
## 1 "RT @GanferHaarFinn: Fretrúnir - FART-RUNES\nOne of the strangest (&…
## source reply_to_status_id reply_to_user_id
## <chr> <chr> <chr>
## 1 Twitter Web Client <NA> <NA>
## reply_to_screen_name is_quote is_retweet favorite_count retweet_count
## <chr> <lgl> <lgl> <int> <int>
## 1 <NA> F T 0 91
## hashtags symbols urls_url urls_t.co urls_expanded_url media_url
## <chr> <lgl> <chr> <chr> <chr> <chr>
## 1 <NA> NA <NA> <NA> <NA> <NA>
## media_t.co media_expanded_url media_type ext_media_url ext_media_t.co
## <chr> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> <NA> <NA> <NA>
## ext_media_expanded_url ext_media_type mentions_user_id
## <chr> <lgl> <chr>
## 1 <NA> NA 2151784741
## mentions_screen_name lang quoted_status_id quoted_text
## <chr> <chr> <chr> <chr>
## 1 GanferHaarFinn en <NA> <NA>
## retweet_status_id
## <chr>
## 1 992032329416798209
## retweet_text
## <chr>
## 1 "Fretrúnir - FART-RUNES\nOne of the strangest (& most horrible) cur…
## place_url place_name place_full_name place_type country country_code
## <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 NA NA NA NA NA NA
## geo_coords coords_coords bbox_coords
## <lgl> <lgl> <lgl>
## 1 NA NA NA
None of the URL fields refer to the image at all. My first hypothesis was that the retweet (which is what my client actually found) doesn’t record the URL of the image? But that doesn’t make a lot of sense - the tweets for which the tw
tibble did record the image URL were retweets too. My next thought was to scan the text of the tweets and compare that to whether there was a URL listed anywhere
linkInText <- str_detect(tw$text, fixed("http"))
hasMediaURL <- !is.na(tw$media_url)
table(hasMediaURL, linkInText)
## linkInText
## hasMediaURL FALSE TRUE
## FALSE 896 55
## TRUE 0 49
That seems informative. Okay a closer look. Here’s the text of the tweet that I downloaded:
tw$text[4]
## [1] "RT @GanferHaarFinn: Fretrúnir - FART-RUNES\nOne of the strangest (& most horrible) curses recorded in Icelandic sorcery involves casting a #…"
But there’s also a retweet_text
field:
tw$retweet_text[4]
## [1] "Fretrúnir - FART-RUNES\nOne of the strangest (& most horrible) curses recorded in Icelandic sorcery involves casting a #FartRunes spell.\nIn 1654 one man was burnt at the stake after admitting casting Fretrúnir on a local girl. (Lots more info in comments thread)\n#FolkloreThursday https://t.co/jjb6ngo0MQ"
The URL at the end takes you to the original tweet (i.e.,the one I embedded above). Presumably if I then used the rtweet
client to download the original tweet, it would have the links to the images?
This seems to happen quite a bit. One of the tweets in tw
is a retweet of this one:
Shānguǐ (山鬼). Literally “mountain ghost.” a jilted lover in the classical Nine Songs, she developed over time into a woman living in the wilderness, married to a red leopard or a tiger. Considered the goddess of Mt. Wū (巫山神女). Painted by Hwa San-chiuen.#FolkloreThursday pic.twitter.com/nh77sOsZi7
— Laurie Lei (@laurielei) February 8, 2018
As before, the text
column within the retweet itself is truncated:
tw$text[20]
## [1] "RT @beakheads: Recent Green Man-themed writing for @elementumjournl reminds me of the many amazing carvings worth seeing @ExeterCathedral (…"
There is no mention in any of of the “URL related” columns of tw
to an image, but as before the retweet_text
column contains the link to the original tweet:
tw$retweet_text[20]
## [1] "Recent Green Man-themed writing for @elementumjournl reminds me of the many amazing carvings worth seeing @ExeterCathedral (photos by @Mark_Ware) #FolkloreThursday https://t.co/OqrL5c8CIc"
So I think there is something going involving truncation with retweets, but that’s maybe not the whole story? I’m not quite sure if there’s an easy fix – is there some argument I can pass to the Twitter API that would return the image links within the original tweet, or would that require a different query? I have a suspicion that Twitter might not let me pull this information? In any case, I’ve run short on time, so I’ll leave that to another time.
Wrapping up
This was fun! I guess I have a lot to learn about the Twitter API (note to self: here) but it’s definitely enjoyable!
Postscript
After doing a little more reading, it’s worth noting that the Twitter terms of service (as of May 6th!) do allow you to crawl the site, so long as the queries respect the robots.txt file; for creating derivative works (which, arguably, any data set compiled in the manner I did so here), you must adhere to the Developer Agreement and Developer Policy. I think that what I’ve posted here is consistent with these - I try to take ToS provisions seriously 😄