Tag Archives: R

ngramr – an R package for Google Ngrams

The recent post How common are common words? made use of unusually explicit language for the Stubborn Mule. As expected, a number of email subscribers reported that the post fell foul of their email filters. Here I will return to the topic of n-grams, while keeping the language cleaner, and describe the R package I developed to generate n-gram charts.

Rather than an explicit language warning, this post carries a technical language warning: regular readers of the blog who are not familiar with the R statistical computing system may want to stop reading now!

The Google Ngram Viewer is a tool for tracking the frequency of words or phrases across the vast collection of scanned texts in Google Books. As an example, the chart below shows the frequency of the words “Marx” and “Freud”. It appears that Marx peaked in popularity in the late 1970s and has been in decline ever since. Freud persisted for a decade longer but has likewise been in decline.

The Ngram Viewer will display an n-gram chart, but does not provide the underlying data for your own analysis. But all is not lost. The chart is produced using JavaScript and so the n-gram data is buried in the source of the web page in the code. It looks something like this:

// Add column headings, with escaping for JS strings.

data.addColumn('number', 'Year');
data.addColumn('number', 'Marx');
data.addColumn('number', 'Freud');

// Add graph data, without autoescaping.

data.addRows(
[[1900, 2.0528437403299904e-06, 1.2246303970897543e-07],
[1901, 1.9467918036752963e-06, 1.1974195999187031e-07],
...
[2008, 1.1858645848406013e-05, 1.3913611155658145e-05]]
)

With the help of the RJSONIO package, it is easy enough to parse this data into an R dataframe. Here is how I did it:

ngram_parse <- function(html){
  if (any(grepl("No valid ngrams to plot!",
                html))) stop("No valid ngrams.") 
    
  cols <- lapply(strsplit(grep("addColumn", html,
                               value=TRUE), ","),
                getElement, 2)
  
  cols <- gsub(".*'(.*)'.*", "\\1", cols)

I realise that is not particularly beautiful, so to make life easier I have bundled everything up neatly into an R package which I have called ngramr, hosted on GitHub.

The core functions are ngram, which queries the Ngram viewer and returns a dataframe of frequencies, ngrami which does the same thing in a somewhat case insensitive manner (by which I mean that, for example, the results for "mouse", "Mouse" and "MOUSE" are all combined) and ggram which retrieves the data and plots the results using ggplot2. All of these functions allow you to specify various options, including the date range and the language corpus (Google can provide results for US English, British English or a number of other languages including German and Chinese).

The package is easy to install from GitHub and I may also post it on CRAN.

I would be very interested in feedback from anyone who tries out this package and will happily consider implementing any suggested enhancements.

UPDATE: ngramr is now available on CRAN, making it much easier to install.

What is Tony talking about?

8 Replies

I first experimented with word clouds several years ago and used them to visualise the speeches of Kevin Rudd and Malcolm Turnbull. I have now learned from the Fell Stats blog (via R-Bloggers) that there is an R package for generating word clouds. The package makes use of tm, a text mining package for R, which I have been meaning to look into for some time. So, it seemed only appropriate to explore the speeches of Tony Abbott.

This word cloud shows the 150 most-used words in Tony’s speeches over the last 18 years. Perhaps disappointingly, since my efforts to strip punctuation also stripped apostrophes, “cant” actually only shows the frequency of the word “can’t”.

Pretty though the word cloud is, a little more can be gleaned from the word usage patterns through time. The correlation in recent years between “carbon” and “tax”, is clearly due to Abbott’s attacks on Labor’s imposition of a price on carbon. His stint as health minister is also evident. I did expect to see more of an impact from his “stop the boats” campaign (here the count for “boat” includes “boats”).

Abbott word count through time

Admittedly, there are no particularly deep insights here, but it was a fun way to learn about the tm and wordcloud packages.

UPDATE: In response to the comment from Dan, I have added a chart showing word frequency rather than count. This accounts for distortions arising from the larger number of Abbott speeches in recent years.

Abbott word frequency through time

For those who are interested, I have uploaded the (python) code for downloading the speeches and the (R) code for generating the charts to github.

Benford’s Law

24 Replies

Here is a quick quiz. If you visit the Wikipedia page List of countries by GDP, you will find three lists ranking the countries of the world in terms of their Gross Domestic Product (GDP), each list corresponding to a different source of the data. If you pick the list according to the CIA (let’s face it, the CIA just sounds more exciting than the IMF or the World Bank), you should have a list of figures (denominated in US dollars) for 216 countries. Ignore the fact that the European Union is in the list along with the individual counties, and think about the first digit of each of the GDP values. What proportion of the data points start with 1? How about 2? Or 3 through to 9?

If you think they would all be about the same, you have not come across Benford’s Law. In fact, far more of the national GDP figures start with 1 than any other digit and fewer start with 9 than any other digit. The columns in the chart below shows the distribution of the leading digits (I will explain the dots and bars in a moment).

Distribution of leading digits of GDP for 216 countries (in US$)

This phenomenon is not unique to GDP. Indeed a 1937 paper described a similar pattern of leading digit frequencies across a baffling array of measurements, including areas of rivers, street addresses of “American men of Science” and numbers appearing in front-page newspaper stories. The paper was titled “The Law of Anomalous Numbers” and was written by Frank Benford, who thereby gave his name to the phenomenon.

Benford’s Law of Anomalous Numbers states that that for many datasets, the proportion of data points with leading digit n will be approximated by

log₁₀(n+1) – log₁₀(n).

So, around 30.1% of the data should start with a 1, while only around 4.6% should start with a 9. The horizontal lines in the chart above show these theoretical proportions. It would appear that the GDP data features more leading 2s and fewer leading 3s than Benford’s Law would predict, but it is a relatively small sample of data, so some variation from the theoretical distribution should be expected.

As a variation of the usual tests of Benford’s Law, I thought I would choose a rather modern data set to test it on: Twitter follower numbers. Fortunately, there is an R package perfectly suited to this task: twitteR. With twitteR installed, I looked at all of the twitter users who follow @stubbornmule and recorded how many users follow each of them. With only a relatively small follower base, this gave me a set of 342 data points which follows Benford’s Law remarkably well.

;

Distribution of leading digits of follower counts

As a measure of how well the data follows Benford’s Law, I have adopted the approach described by Rachel Fewster in her excellent paper A Simple Explanation of Benford’s Law. For the statistically-minded, this involves defining a chi-squared statistic which measures “badness” of Benford fit. This statistic provides a “p value” which you can think of as the probability that Benford’s Law could produce a distribution that looks like your data set. The follower-count for @stubbornmule is a very high 0.97, which shows a very good fit to the law. By way of contrast, if those 342 data points had a uniform distribution of leading digits, the p value would be less than 10^-15, which would be a convincing violation of Benford’s Law.

Since so many data sets do follow Benford’s Law, this kind of statistical analysis has been used to detect fraud. If you were a budding Enron-style accountant set on falsifying your company’s accounts, you may not be aware of Benford’s Law. As a result, you may end up inventing too many figures starting with 9 and not enough starting with 1. Exactly this style of analysis is described in the 2004 paper The Effective Use of Benford’s Law to Assist in Detecting Fraud in Accounting Data by Durtshi, Hillison and Pacini.

By this point, you are probably asking one question: why does it work? It is an excellent question, and a surprisingly difficult and somewhat controversial one. At current count, an online bibliography of papers on Benford Law lists 657 papers on the subject. For me, the best explanation is Fewster’s “simple explanation” which is based her “Law of the Stripey Hat”. However simple it may be, it warrants a blog post of its own, so I will be keeping you in suspense a little longer. In the process, I will also explain some circumstances in which you should not expect Benford’s Law to hold (as an example, think about phone numbers in a telephone book).

In the meantime, having gone to the trouble of adapting Fewster’s R Code to produce charts testing how closely twitter follower counts fit Benford’s Law, I feel I should share a few more examples. My personal twitter account, @seancarmody, has more followers than @stubbornmule and the pattern of leading digits in my followers’ follower counts also provides a good illustration of Benford’s Law.

One of my twitter friends, @stilgherrian, has even more followers than I do and so provides an even larger data set.

Even though the bars seem to follow the Benford pattern quite well here, the p value is a rather low 5.5%. This reflects the fact that the larger the sample, the closer the fit should be to the theoretical frequencies if the data set really follows Benford’s Law. This result appears to be largely due to more leading 1s than expected and fewer leading 2s. To get a better idea of what is happening to the follower counts of stilgherrian’s followers, below is a density* histogram of the follower counts on a log₁₀ scale.

There are a few things we can glean from this chart. First, the spike at zero represents accounts with only a single follower, accounting around 1% of stilgherrian’s followers (since we are working on a log scale, the followers with no followers of their own do not appear on the chart at all). Most of the data is in the range 2 (accounts with 100 followers) to 3 (accounts with 1000 followers). Between 3 and 4 (10,000 followers), the distribution falls of rapidly. This suggests that the deviation from Benford’s Law is due to a fair number users with a follower count in the 1000-1999 range (I am one of those myself), but a shortage in the 2000-2999 range. Beyond that, the number of data points becomes too small to have much of an effect.

Histogram of follower counts of @stilgherrian’s followers

Of course, the point of this analysis is not to suggest that there is anything particularly meaningful about the follower counts of twitter users, but to highlight the fact that even the most peculiar of data sets found “in nature” is likely to yield to the power of Benford’s Law.

* A density histogram scales the vertical axis to ensure that the histogram covers a region of area one rather than the frequency of occurrences in each bin.

Hottest 100 for 2011

3 Replies

Another year, another Australia Day. Another Australia Day, another Triple J Hottest 100. And that, of course, means an excellent excuse to set R to work on the chart data.

For those outside Australia, the Hottest 100 is a chart of the most popular songs of the previous year, as voted by the listeners of the radio station Triple J. The tradition began in 1991, but initially people voted for their favourite song of all time. From 1993 onwards, the poll took its current form* and was restricted to tracks released in the year in question.

Since the Hottest 100 Wikipedia pages include country of origin**, I thought I would see whether there is any pattern in whose music Australians like best. Since it is Australia Day, it is only appropriate that we are partial to Australian artists and they typically make up close to half of the 100 entries. Interestingly, in the early 90s, Australian artists did not do so well. The United Kingdom has put in a good showing over the last two years, pulling ahead of the United States. Beyond the big three, Australia, UK and US, the pickings get slim very quickly, so I have only included Canada and New Zealand in the chart below.

Number of Hottest 100 tracks by Country

If you have excellent eyesight, you may notice that 2010 is missing from the chart. For some reason, this is the only year which does not include the full chart listing on the Wikipedia page. There is a link to a list on the ABC website, but unfortunately it does not include the country of origin. Maybe a keen Wikipedian reading this post will help by updating the page.

I make no great claims for the sophistication or the insight of this analysis: it was really an excuse to learn about using the XML package for R to pull data from tables in web pages.

require(XML)
require(ggplot2)
require(reshape2)

results <- data.frame()
col.names <- c("year", "rank", "title", "artist", "country")

# Skip 2010: full list is missing from Wikipedia page
years <- c(1993:2009, 2011)

for (year in years) {
    base.url <- "http://en.wikipedia.org/wiki/Triple_J_Hottest_100,"
    year.url <- paste(base.url, year, sep="_")
    tables <- readHTMLTable(year.url, stringsAsFactor=FALSE)
    table.len <- sapply(tables, length)
    hot <- cbind(year=year, tables[[which(table.len==4)]])
    names(hot) <- col.names
    results <- rbind(results, hot)
}

# Remap a few countries
results$country[results$country=="Australia [1]"] <- "Australia"
results$country[results$country=="England"] <- "United Kingdom"
results$country[results$country=="Scotland"] <- "United Kingdom"
results$country[results$country=="Wales"] <- "United Kingdom"
results$country[results$country=="England, Wales"] <-"United Kingdom"

# Countries to plot
top5 <- c("Australia", "United States", "United Kingdom",
  "Canada", "New Zealand")

# Create a colourful ggplot chart
plt <- ggplot(subset(results, country %in% top5),
    aes(factor(year), fill=factor(country)))
plt <- plt + geom_bar() + facet_grid(country ~ .)
plt <- plt + labs(x="", y="") + opts(legend.position = "none")

Created by Pretty R at inside-R.org

UPDATE: there is a little bit more analysis in this follow-up post.

* Since the shift to single year charts, there have been two all-time Hottest 100s: 1998 and 2009.

** There are some country combinations, such as “Australia/England”, but the numbers are so small I have simply excluded them from the analysis.

More colour wheels

2 Replies

In response to my post about colour wheels, I received a suggested enhancement from Drew. The idea is to first match colours based on the text provided and then add nearby colours. This can be done by ordering colours in terms of hue, saturation, and value. The result is a significant improvement and it will capture all of those colours with more obscure names.

Here is my variant of Drew’s function:

col.wheel <- function(str, nearby=3, cex=0.75) {
	cols <- colors()
	hsvs <- rgb2hsv(col2rgb(cols))
	srt <- order(hsvs[1,], hsvs[2,], hsvs[3,])
	cols <- cols[srt]
	ind <- grep(str, cols)
	if (length(ind) <1) stop("no colour matches found",
		call.=FALSE)
	s.ind <- ind
	if (nearby>1) for (i in 1:nearby) {
		s.ind <- c(s.ind, ind+i, ind-i)
	}
	ind <- sort(unique(s.ind))
	ind <- ind[ind <= length(cols)]
	cols <- cols[ind]
	pie(rep(1, length(cols)), labels=cols, col=cols, cex=cex)
	cols
}

I have included an additional parameter, nearby, which specifies the range of additional colours to include. A setting of 1 will include colours matching the specified string and also one colour on either side of each of these. By default, nearby is set to 3.

The wheel below shows the results for col.wheel(“thistle”, nearby=5). As well as the various shades of “thistle”, this also uncovers “plum” and “orchid”.

This is far more powerful than the original function: thanks Drew.

Colour wheels in R

11 Replies

Regular readers will know I use the R package to produce most of the charts that appear here on the blog. Being more quantitative than artistic, I find choosing colours for the charts to be one of the trickiest tasks when designing a chart, particularly as R has so many colours to choose from.

In R, colours are specified by name, with names ranging from the obvious “red”, “green” and “blue” to the obscure “mintycream”, “moccasin” and “goldenrod”. The full list of 657 named colours can be found using the colors() function, but that is a long list to wade through if you just want to get exactly the right shade of green, so I have come up with a shortcut which I thought I would share here*.

Below is a simple function called col.wheel which will display a colour wheel of all the colours matching a specified keyword.

col.wheel <- function(str, cex=0.75) {
	cols <- colors()[grep(str, colors())]
	pie(rep(1, length(cols)), labels=cols, col=cols, cex=cex)
	cols
	}

To use the function, simply pass it a string:

col.wheel("rod")

As well as displaying the colour wheel below, this will return a list of all of the colour names which include the specified string.

 [1] "darkgoldenrod"        "darkgoldenrod1"       "darkgoldenrod2"
 [4] "darkgoldenrod3"       "darkgoldenrod4"       "goldenrod"
 [7] "goldenrod1"           "goldenrod2"           "goldenrod3"
[10] "goldenrod4"           "lightgoldenrod"       "lightgoldenrod1"
[13] "lightgoldenrod2"      "lightgoldenrod3"      "lightgoldenrod4"
[16] "lightgoldenrodyellow" "palegoldenrod"

In fact, col.wheel will accept a regular expression, so you could get fancy and ask for colours matching “r.d” which will give you all the reds and the goldenrods.

This trick does have its limitations: you are not likely to find “thistle”, “orchid” or “cornsilk” this way, but I have found it quite handy and others may too.

*My tip was inspired by this page about R colours.

A gentle introduction to R

2 Replies

Whenever a post on this blog requires some data analysis and perhaps a chart or two, my tool of choice is the versatile statistical programming package R. Developed as an open-source implementation of an engine for the S programming language, R is therefore free. Since commercial mathematical packages can costs thousands of dollars, this alone makes R worth investigating. But what makes R particularly powerful is the large and growing array of specialised packages. For any statistical problem you come across, the chances are that someone has written a package that will make the problem much easier to get to grips with.

If it was not already clear, I am something of an R evangelist and I am not the only one. The growing membership of the Sydney Users of R Forum (SURF) suggests that we are getting some traction and there are a lot of people interested in learning more about R.

Sooner or later, every R beginner will come across An Introduction to R, which appears as the first link under Manuals on the R website. If you work your way through this introduction, you will get a good grounding in the essentials for using R. Unfortunately, it is very dry and it can be a challenge to get through. I certainly never managed to read it from start to finish in one sitting, but having used R for more than 10 years, I regularly return to read bits and pieces, so by now I have read and re-read it all many times. So, useful though this introduction is, it is not always a great place to start for R beginners.

There are many books available about R, including books focusing on the language itself, books on graphics in R, books on implementing particular statistical techniques in R and more than one introduction to R. A few weeks ago I was offered an electronic review copy of Statistical Analysis With R, a new beginner’s introduction to R by John M. Quick. Curious to see whether it could offer a good springboard into R, I decided to take up the offer.

At around 300 pages and covering a little less ground, it certainly takes a more leisurely pace than An Introduction to R. It also attempts a more engaging style by building a narrative around the premise that you have become a strategist for the Shu army in 3rd century China. The worked examples are all built around the challenge of looking at past battle statistics to determine the best strategy for a campaign against the rival Wei kingdom. Given how hard it can be to make an introduction to a statistical programming language exciting, it is certainly worth trying a novel approach. Still, some readers may find the Shu theme a little corny.

The book begins with instructions for downloading and installing R and goes on to explore the basics of importing and manipulating data, statistical exploration of the data (means, standard deviations and correlations), linear regression and finishes with a couple of chapters on producing and customising charts. This is a good selection of topics: mastery of these will provide beginners good grounding in the core capabilities of R. Readers with limited experience with statistics may be reassured that no assumptions are made about mathematical knowledge. The exploration of the battle data is used to provide a simple explanation of what linear regression is as well as the techniques available in R to perform the computations. While this approach certainly makes the book accessible to a broader audience, it is not without risks. Statistical tools are notorious for being abused by people who do not understand them properly. As a friend of mine likes to say, “drive-by regressions” can do a lot of damage!

Each chapter adopts the same structure: a brief introduction advancing the Shu story; a list of the topics covered in the chapter; a series of worked examples with sample commands to be entered into the R console followed by an explanatory “What just happened?” section and a “Pop quiz”; suggestions for further tasks for the readers to try; and finally a chapter summary. At times this approach feels a little repetitive (and the recurring heading “Have a go hero” for the suggested further tasks section may sound a little sarcastic to Australian readers at least), but it is thorough.

If I were to write my own introduction to R (one day perhaps?), I would do some things a little differently. I would try to explain a bit more about the semantics of the language, particularly the difference amongst the various data types (vectors, lists, data frames and so on). But perhaps that would just end up being as dry as An Introduction to R. Also, though I certainly agree with Quick that commenting your code is a very important discipline (even if no-one else ever reads it, you might have to read it again yourself!), I do think that he takes this principle too far in expecting readers to type all of the comments in the worked examples into the console!

Statistical Analysis With R is a very gentle introduction to R. If you have no prior experience of R, reading this book will certainly get you started. On the other hand, if you have already started experimenting with R, the pace may just be a little too slow.

Generate your own Risk Characterization Theatre

1 Reply

In the recent posts Visualizing Smoking Risk and Shades of grey I wrote about the use of “Risk Characterization Theatres” (RCTs) to communicate probabilities. I found the idea in the book The Illusion of Certainty, by Eric Rifkin and Edward Bouwer. Here is how they explain the RCTs:

Most of us are familiar with the crowd in a typical theater as a graphic illustration of a population grouping. It occurred to us that a theater seating chart would be useful for illustrating health benefit and risk information. With a seating capacity of 1,000, our Risk Characterization Theater (RCT) makes it easy to illustrate a number of important values: the number of individuals who would benefit from screening tests, the number of individuals contracting a disease due to a specific cause (e.g., HIV and AIDS), and the merits of published risk factors (e.g., elevated cholesterol, exposure to low levels of environmental contaminants).

As regular readers would know, most of the charts here on the blog are produced using the statistics and graphics tool called R. The RCT graphics were no exception. Writing the code involved painstakingly reproducing Rifkin and Bouwer’s theatre floor plan (as well as a few of my own design, including the stadium). For the benefit of anyone who would like to try generating their own RCTs, I have published the code on github.

Using the code is straightforward (once you have installed R). Copy the two files plans.Rdata and RCT.R onto your computer. Fire up R and switch to the directory containing the downloaded files. Load the code using the following command:

source("RCT.R")

You will then have a function available called rct which will generate the RCTs. Try the following examples:

rct(18)
rct(18, type="theatre")
rct(18, type="stadium")
rct(c(10, 8, 5))

The rct function has quite a few optional parameters to tweak the appearance of the theatre:

rct(cases, type=”square”, border=”grey”, fill=NULL, xlab=NULL, ylab=””, lab.cex=1, seed=NULL, label=FALSE, lab.col=”grey”, draw.plot=TRUE)

cases: single number or vector giving the number of seats to shade. If a vector is supplied, the values indicate how many seats of each colour to shade. The sum of this vector gives the total number of seats shaded
type: the floor plan to be used. Current options are “square”, “theatre” (the original Rifkin and Bouwer floor plan), “stadium” and “bigsquare”
border: the color for the outlines of the floor plan
fill: vector of colours for shading seats. If no value is supplied, the default is a sequence of shades of grey
xlab: text label to appear below floor plan. Default is “x cases in n”
lab.cex: character expansion factor (see ‘par’) to specify size of text labels (if any) on the floor plan
seed: specify the starting seed value for the random number generator. Setting this makes it possible to reproduce exactly the same shaded seats on successive calls of rct
label: if TRUE, any text labels for the specified floor plan will be displayed
lab.col: colour used for any text labels
draw.plot: if this is FALSE, the RCT is not drawn and instead a data frame is returned showing the seats that would have been shaded and the colours that would have been used

The Mule goes SURFing

3 Replies

A month ago I posted about “SURF”, the newly-established Sydney R user forum (R being an excellent open-source statistics tool). Shortly after publishing that post, I attended the inaugural forum meeting.

While we waited for attendees to arrive, a few people introduced themselves, explaining why they were interested in R and how much experience they had with the system. I was surprised at the diversity of backgrounds represented: there was someone from the department of immigration, a few from various areas within the health-care industry, a group from the Australian Copyright Council (I think I’ve got that right—it was certainly something to do with copyright), a few from finance, some academics and even someone from the office of state revenue.

Of the 30 or so people who came to the meeting, many classed themselves as beginners when it came to R (although most had experience with other systems, such as SAS). So if there’s anyone out there who was toying with the idea of signing up but hesitated out of concern that they know nothing about R, do not fear. You will not be alone.

The forum organizer, Eugene Dubossarsky, proceeded to give an overview of the recent growth in R’s popularity and also gave a live demo of how quickly and easily you can get R installed and running. Since there were so many beginners, Eugene suggested that a few of the more experienced users could act as mentors to those interested in learning more about R. As someone who has used R for over 10 years, I volunteered my services. So feel free to ask me any and all of your R questions!

As well as being a volunteer mentor, I will have the pleasure of being the presenter at the next forum meeting on the 18th of August. Regular readers of the Stubborn Mule will not be surprised to learn that the topic I have chosen is The Power of Graphics in R. Here’s the overview of what I will be talking about:

In addition to its statistical computing prowess, R is one of the most sophisticated and flexible tools around for visualizing quantitative data. It can produce a wide variety of chart types, including scatter plots, box plots, dot plots, mosaic plots, 3D charts and more. Tweaking chart settings and adding customized annotations is a breeze and the charts can readily be output to a range of formats including images (jpeg or png), PDF and metafile formats.

Topics covered in this talk include:

Getting started with graphing in R

The basic charting types available

Customising charts (labels, axes, colour, annotations and more)

Managing different output formats

A look at the more advanced charting packages: lattice and ggplot2

Anyone who ever has a need to visualize their data, whether simply for exploration or for producing slick graphics for reports and presentations can benefit from learning to use R’s graphics features. The material presented here will get you well on your way. If you have ever been frustrated when trying to get charts in Excel to behave themselves, you will never look back once you switch to R.

For those of you in Sydney who are interested in a glimpse of how I use R to produce the charts you see here on the blog, feel free to come along. I hope to see you there!

Stubborn Mule

Obstinately objective