I first experimented with word clouds several years ago and used them to visualise the speeches of Kevin Rudd and Malcolm Turnbull. I have now learned from the Fell Stats blog (via R-Bloggers) that there is an R package for generating word clouds. The package makes use of tm, a text mining package for R, which I have been meaning to look into for some time. So, it seemed only appropriate to explore the speeches of Tony Abbott.
This word cloud shows the 150 most-used words in Tony’s speeches over the last 18 years. Perhaps disappointingly, since my efforts to strip punctuation also stripped apostrophes, “cant” actually only shows the frequency of the word “can’t”.
Pretty though the word cloud is, a little more can be gleaned from the word usage patterns through time. The correlation in recent years between “carbon” and “tax”, is clearly due to Abbott’s attacks on Labor’s imposition of a price on carbon. His stint as health minister is also evident. I did expect to see more of an impact from his “stop the boats” campaign (here the count for “boat” includes “boats”).
Abbott word count through time
Admittedly, there are no particularly deep insights here, but it was a fun way to learn about the tm and wordcloud packages.
UPDATE: In response to the comment from Dan, I have added a chart showing word frequency rather than count. This accounts for distortions arising from the larger number of Abbott speeches in recent years.
Abbott word frequency through time
For those who are interested, I have uploaded the (python) code for downloading the speeches and the (R) code for generating the charts to github.
Possibly Related Posts (automatically generated):
- Malcolm Turnbull’s Word Cloud (20 August 2009)
- Taking It Too Far: Verb and Adjective Clouds (21 August 2009)
- What is Kevin Saying? (18 August 2009)
- The Big Arms Traders (1 August 2009)
Sean, haven’t looked at the underlying data, but to what extent are the spikes to the right of the graphs (increases in mentions) simply a function of Abbot saying more since becoming leader?
I think a more interesting graph would be one which shows the changing frequency of a word as a proportion of all words (say for a year – because then you won’t have to seasonally adjust for the quiet periods over Xmas etc).
You might find no change. Or – I suspect – you might find less change. Obviously things like “Carbon” or this week “Muslim” will peak and trough on the news cycle, which is not really in Tony’s control.
It is a very fair point: speeches in the earlier years are fewer and further between. I will roll up my sleeves and turn it into frequencies.
Dan, I have added a frequency chart. There is a pickup in earlier years, but the broad pattern is similar. I also checked “muslim” but the frequencies were very low.
For some reason, 2008 was a quiet year for Tony with only five speeches.
As far as the apostrophe problem…
If you’re willing to download my beta package qdap: https://github.com/trinker/qdap you can use strip to remove punctuation except apostrophes.
library(devtools)
install_github(“qdap”, “trinker”)
strip(scrubber(x), apostrophe.remove=FALSE)
Don’t bother with the package: Here’s 3 ways to do it with regexing:
gsub(“[^[:alnum:][:space:]’\”]”, “”, x) #METHOD 1
gsub(“.*?($|’|[^[:punct:]]).*?”, “\\1”, x) #METHOD 2
gsub(“(.*?)($|’|[^[:punct:]]+?)(.*?)”, “\\2”, x) #METHOD 3
Thanks Tyler: I will give it a try with this regex magic.
Next Mulepost: “What is Tyler talking about?!?”