ngramr – an R package for Google Ngrams

36 Replies

The recent post How common are common words? made use of unusually explicit language for the Stubborn Mule. As expected, a number of email subscribers reported that the post fell foul of their email filters. Here I will return to the topic of n-grams, while keeping the language cleaner, and describe the R package I developed to generate n-gram charts.

Rather than an explicit language warning, this post carries a technical language warning: regular readers of the blog who are not familiar with the R statistical computing system may want to stop reading now!

The Google Ngram Viewer is a tool for tracking the frequency of words or phrases across the vast collection of scanned texts in Google Books. As an example, the chart below shows the frequency of the words “Marx” and “Freud”. It appears that Marx peaked in popularity in the late 1970s and has been in decline ever since. Freud persisted for a decade longer but has likewise been in decline.

The Ngram Viewer will display an n-gram chart, but does not provide the underlying data for your own analysis. But all is not lost. The chart is produced using JavaScript and so the n-gram data is buried in the source of the web page in the code. It looks something like this:

// Add column headings, with escaping for JS strings.

data.addColumn('number', 'Year');
data.addColumn('number', 'Marx');
data.addColumn('number', 'Freud');

// Add graph data, without autoescaping.

data.addRows(
[[1900, 2.0528437403299904e-06, 1.2246303970897543e-07],
[1901, 1.9467918036752963e-06, 1.1974195999187031e-07],
...
[2008, 1.1858645848406013e-05, 1.3913611155658145e-05]]
)

With the help of the RJSONIO package, it is easy enough to parse this data into an R dataframe. Here is how I did it:

ngram_parse <- function(html){
  if (any(grepl("No valid ngrams to plot!",
                html))) stop("No valid ngrams.") 
    
  cols <- lapply(strsplit(grep("addColumn", html,
                               value=TRUE), ","),
                getElement, 2)
  
  cols <- gsub(".*'(.*)'.*", "\\1", cols)

I realise that is not particularly beautiful, so to make life easier I have bundled everything up neatly into an R package which I have called ngramr, hosted on GitHub.

The core functions are ngram, which queries the Ngram viewer and returns a dataframe of frequencies, ngrami which does the same thing in a somewhat case insensitive manner (by which I mean that, for example, the results for "mouse", "Mouse" and "MOUSE" are all combined) and ggram which retrieves the data and plots the results using ggplot2. All of these functions allow you to specify various options, including the date range and the language corpus (Google can provide results for US English, British English or a number of other languages including German and Chinese).

The package is easy to install from GitHub and I may also post it on CRAN.

I would be very interested in feedback from anyone who tries out this package and will happily consider implementing any suggested enhancements.

UPDATE: ngramr is now available on CRAN, making it much easier to install.

Possibly Related Posts (automatically generated):

How common are common words? (11 July 2013)
Junk Charts #3 – US Business Lending (23 February 2010)
Online Data and Charts with Swivel (10 August 2008)
The Mule trips up (21 April 2010)

36 thoughts on “ngramr – an R package for Google Ngrams”

Fr. 20 July 2013 at 9:21 am

Nice job! I am not sure I understand the smoothing parameter, though, and the aggregate argument has failed a few quick tests. I have made a few suggestions to make the function a bit more robust and to provide more flexibility with geoms. I’m also suggesting the GGally package as a possible candidate to publish the function in.
Stubborn Mule Post author20 July 2013 at 1:39 pm

@Fr thanks for the suggested edits over on github: I have incorporated your suggestions. Could you give me some examples of the errors you got with your aggregate tests? I suspect that there is scope for further error trapping!
Stephen Peplow 22 July 2013 at 8:07 am

Hi—I can’t seem to get ngramr working. I have tried both the devtools method and installing from a local zip file. Here is the error message:

Error in .install_package_code_files(“.”, instdir) :
files in ‘Collate’ field missing from ‘C:/Users/Stephen/AppData/Local/Temp/RtmpEBIzzL/ngramr-master/R’:
themes.R
ERROR: unable to collate and parse R files for package ‘ngramr’
* removing ‘C:/Users/Stephen/Documents/R/win-library/2.15/ngramr’
Error: Command failed (1)

This is a really neat application and I’d love to get it going.
Thanks for any help
Stephen
Stubborn Mule Post author22 July 2013 at 9:25 am

@Stephen: sorry about that. I’ve been tweaking some new functionality and seem to have broken it! I will let you know as soon as it is fixed.
Fr. 22 July 2013 at 5:18 pm

I have flagged the line that is probably at fault in the code, it should be easy to fix.
Stubborn Mule Post author22 July 2013 at 5:35 pm

Yes indeed: I was part way through some changes and must have pushed them prematurely up to GitHub. I will fix them when I get home and make sure I adopt better practice and establish a development branch!
Stubborn Mule Post author22 July 2013 at 7:01 pm

@Stephen: it should work now. Sorry for mess!
Stephen Peplow 23 July 2013 at 3:26 am

Thanks — I got it working, but thought you should know: downloading from the ZIP file didn’t work. It just stops. Downloading from GITHUB worked, except users should be aware that they’ll need to update their version of R. Small thing: the example code you give at the top for hacker etc doesn’t include require(ggplot2). I am going to write up my own example and will send you a link. Thanks for all this!
Stubborn Mule Post author23 July 2013 at 6:03 am

@Stephen: thanks for the feedback. I will look into the ZIP file issue and reflect the other comments in the instructions.

Stubborn Mule Post author23 July 2013 at 8:49 pm

@Stephen: what problem did you have with the ZIP install? My testing looked like this:

> library(devtools)
> install_local("~/Downloads/ngramr-master.zip")
Installing package from ~/Downloads/ngramr-master.zip
Installing ngramr
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL  
  '/private/var/folders/0h/3l97r8gd48jbm5mlhj_2pmf80000gn/T/RtmpkReS0d/ngramr-master'  
  --library='/Library/Frameworks/R.framework/Versions/3.0/Resources/library' --with-keep.source  
  --install-tests 

* installing *source* package 'ngramr' ...
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (ngramr)

Pingback: Words and culture | GIS and Statistics at KPU
Maxine 24 December 2013 at 2:35 am

This is fantastic! I was just wondering how R goes about dealing with accents? This seems to be somewhat of a barrier for working with the non-English corpora. Thanks again for such a useful bit of code!
Stubborn Mule Post author24 December 2013 at 5:37 pm

While I have not tested this extensively, R, Google ngrams and ngramr seem to behave ok with accents. For example, this seems to work fine:

ggram(“soufflé”, corpus=”fre_2012″, year_start=1800)
Maxine 27 December 2013 at 9:58 pm

Thanks for replying! That’s funny, I can see that works fine but when I look at certain words in I get an error. For example:
ggram(“fécondité”, corpus=”fre_2012″, year_start=1800)
gives me:
Error in data.frame(…, check.names = FALSE) :
arguments imply differing number of rows: 209, 0
Any ideas?
Stubborn Mule Post author27 December 2013 at 10:29 pm

I’ll have a look into it!
Stubborn Mule Post author28 December 2013 at 9:37 am

@Maxine. I tried

ggram(“fécondité”, corpus=”fre_2012″, year_start=1800)

and it worked. Here is the result. Have you got the latest version of ngramr (and other packages)? If you are using RStudio (which I would highly recommend!) this can be done via Tools -> Check for Package Updates.
Maxine 30 December 2013 at 9:13 pm

Thanks, at least I know it’s on my end! I’ve just started using RStudio and have the same issue. For some reason my R is just ignoring the accents, as when I try “soufflé” for example I get the results for “souffl”, and words with an accent within the word (like fécondité) are returning no results, hence the error. Very strange.
Stubborn Mule Post author30 December 2013 at 10:35 pm

@Maxine: what operating system are you using? Windows? I am on a Mac, but will also test on Windows.
Stubborn Mule Post author31 December 2013 at 8:17 am

@Maxine, I have tried it now on Windows and get the same problem as you, so now there is something for me to investigate!
Maxine 31 December 2013 at 9:15 am

You sir, are a god amongst mules!
Stubborn Mule Post author31 December 2013 at 9:25 am

@Maxine I think I have got to the bottom of the problem: a different approach to character encoding on Windows. I think I have sorted out a fix, so will submit it to CRAN. I will keep you posted.
Stubborn Mule Post author31 December 2013 at 5:28 pm

@Maxine the updated version (1.4.2) has been accepted on CRAN. It may take a little while for the binary versions to appear, but do try updating your packages and let me know if this solves your problem.
Maxine 11 January 2014 at 2:26 am

Working beautifully now, thank you for all your hard work!
Stubborn Mule Post author11 January 2014 at 1:47 pm

@Maxine excellent! Pleased to hear it.
Matt 1 February 2014 at 7:23 am

Thanks for writing such a useful package! While working with more obscure words, I’ve encountered a potential bug with the ngrami function. The line
>ngram(“pulser”, corpus=”fre_2012″)
returns the expected full dataset while the case insensitive function,
>ngrami(“pulser”, corpus=”fre_2012″)
returns an error. I believe this is because it is trying to combine the results from “pulser” and “Pulser”, the latter of which is empty.
(When I enter >ngram(“Pulser”, corpus=”fre_2012″), it returns an error because there aren’t any instances of it in the corpus.) I’m using some workarounds, but I figure a fix is possible in the code itself.
Thanks for your help,
Matt Blackshaw
Stubborn Mule Post author1 February 2014 at 9:07 am

I’m glad you are finding the package useful. I will get onto the additional error trapping and keep you posted.
Stubborn Mule Post author2 February 2014 at 9:37 am

@Matt I have uploaded a new release of ngramr. The packages should be rebuilt within 24 hours or so. Let me know if it fixes your problems.
will 16 May 2014 at 5:11 am

Have you have any experience with this error message?

Error in fromJSON(sub(“.*=”, “”, html[data_line])) :
CHAR() can only be applied to a ‘CHARSXP’, not a ‘pairlist’

It seems to be caused by repeated ngram calls — I encountered it in a loop to build a matrix of more than 12 ngrams. Is there a capacity constraint build in by Google?
Stubborn Mule Post author17 May 2014 at 11:04 am

I have not seen that error, but it is certainly possible that there is a capacity constraint: Google changes the way its pages work quite often! Can you post some sample code?
suz 21 August 2014 at 6:26 am

G’day! Thanks for the great work with the package, I love it. Here are two questions from a noob, who wants to query a few dozen words at once:

1. I tried to build a for-loop (not experienced with looping, I admit), but as you need to quote the phrase, I’m unsure on how you’d call ngram() with indexes. I couldn’t find any discussion apart from will’s comment above, so I guess it is possible to loop over a vector with strings. Mine’s not a capacity problem (not there yet!). I just get the error “is.character(phrases) is not TRUE”. I tried ngram(‘”‘cat[i]'”‘), corpus=”eng_us_2012″), but I guess my fault lies elsewhere; like not understanding for-loops. (I only have a few dozen words I’d like to query.) Any suggestions on this issue?

2. What am I doing wrong with the ngrami() function? It returns a line “Browse[1]> ” that expects user input?
Stubborn Mule Post author21 August 2014 at 7:56 am

I’ll have a look into it for you.
suz 21 August 2014 at 4:53 pm

Thanks! Q1 seems somewhat solved; it works fine with sapply(). So my problem has more to do with a misunderstanding of for-loops, I guess.
Stubborn Mule Post author21 August 2014 at 8:37 pm

There is a problem with the ngrami code. I’ve submitted an update to CRAN, so check for an update in 24 hours or so. Thanks for picking it up!
Stubborn Mule Post author21 August 2014 at 8:38 pm

By the way, you can bundle up multiple words in a single call: ngram(c("fish", "cow", "bird"))
Stubborn Mule Post author31 August 2014 at 8:44 pm

I am having problems with the CRAN submission. In the meantime you can get the most up to date version from github.
catphish 7 July 2015 at 2:36 pm

Great script! Quick question: I can’t get any data back for phrases/words that include apostrophes. To process these, Google adds a space (e.g. “X ‘s Y”), but even when I do this the script skips the phrase. Any ideas?

Stubborn Mule

Obstinately objective

ngramr – an R package for Google Ngrams

Possibly Related Posts (automatically generated):

Like this:

Related

36 thoughts on “ngramr – an R package for Google Ngrams”

Leave a Reply Cancel reply

Possibly Related Posts (automatically generated):

Share this post:

Like this:

Related

36 thoughts on “ngramr – an R package for Google Ngrams”

Leave a Reply Cancel reply