ngramr – an R package for Google Ngrams

The recent post How common are common words? made use of unusually explicit language for the Stubborn Mule. As expected, a number of email subscribers reported that the post fell foul of their email filters. Here I will return to the topic of n-grams, while keeping the language cleaner, and describe the R package I developed to generate n-gram charts.

Rather than an explicit language warning, this post carries a technical language warning: regular readers of the blog who are not familiar with the R statistical computing system may want to stop reading now!

The Google Ngram Viewer is a tool for tracking the frequency of words or phrases across the vast collection of scanned texts in Google Books. As an example, the chart below shows the frequency of the words “Marx” and “Freud”. It appears that Marx peaked in popularity in the late 1970s and has been in decline ever since. Freud persisted for a decade longer but has likewise been in decline.

Freud vs Marx ngram chart

The Ngram Viewer will display an n-gram chart, but does not provide the underlying data for your own analysis. But all is not lost. The chart is produced using JavaScript and so the n-gram data is buried in the source of the web page in the code. It looks something like this:

// Add column headings, with escaping for JS strings.

data.addColumn('number', 'Year');
data.addColumn('number', 'Marx');
data.addColumn('number', 'Freud');

// Add graph data, without autoescaping.

data.addRows(
[[1900, 2.0528437403299904e-06, 1.2246303970897543e-07],
[1901, 1.9467918036752963e-06, 1.1974195999187031e-07],
...
[2008, 1.1858645848406013e-05, 1.3913611155658145e-05]]
)

With the help of the RJSONIO package, it is easy enough to parse this data into an R dataframe. Here is how I did it:

ngram_parse <- function(html){
  if (any(grepl("No valid ngrams to plot!",
                html))) stop("No valid ngrams.") 
    
  cols <- lapply(strsplit(grep("addColumn", html,
                               value=TRUE), ","),
                getElement, 2)
  
  cols <- gsub(".*'(.*)'.*", "\\1", cols)

I realise that is not particularly beautiful, so to make life easier I have bundled everything up neatly into an R package which I have called ngramr, hosted on GitHub.

The core functions are ngram, which queries the Ngram viewer and returns a dataframe of frequencies, ngrami which does the same thing in a somewhat case insensitive manner (by which I mean that, for example, the results for "mouse", "Mouse" and "MOUSE" are all combined) and ggram which retrieves the data and plots the results using ggplot2. All of these functions allow you to specify various options, including the date range and the language corpus (Google can provide results for US English, British English or a number of other languages including German and Chinese).

The package is easy to install from GitHub and I may also post it on CRAN.

I would be very interested in feedback from anyone who tries out this package and will happily consider implementing any suggested enhancements.

UPDATE: ngramr is now available on CRAN, making it much easier to install.

Possibly Related Posts (automatically generated):

36 thoughts on “ngramr – an R package for Google Ngrams

  1. Fr.

    Nice job! I am not sure I understand the smoothing parameter, though, and the aggregate argument has failed a few quick tests. I have made a few suggestions to make the function a bit more robust and to provide more flexibility with geoms. I’m also suggesting the GGally package as a possible candidate to publish the function in.

  2. Stubborn Mule Post author

    @Fr thanks for the suggested edits over on github: I have incorporated your suggestions. Could you give me some examples of the errors you got with your aggregate tests? I suspect that there is scope for further error trapping!

  3. Stephen Peplow

    Hi—I can’t seem to get ngramr working. I have tried both the devtools method and installing from a local zip file. Here is the error message:

    Error in .install_package_code_files(“.”, instdir) :
    files in ‘Collate’ field missing from ‘C:/Users/Stephen/AppData/Local/Temp/RtmpEBIzzL/ngramr-master/R’:
    themes.R
    ERROR: unable to collate and parse R files for package ‘ngramr’
    * removing ‘C:/Users/Stephen/Documents/R/win-library/2.15/ngramr’
    Error: Command failed (1)

    This is a really neat application and I’d love to get it going.
    Thanks for any help
    Stephen

  4. Stubborn Mule Post author

    @Stephen: sorry about that. I’ve been tweaking some new functionality and seem to have broken it! I will let you know as soon as it is fixed.

  5. Stubborn Mule Post author

    Yes indeed: I was part way through some changes and must have pushed them prematurely up to GitHub. I will fix them when I get home and make sure I adopt better practice and establish a development branch!

  6. Stephen Peplow

    Thanks — I got it working, but thought you should know: downloading from the ZIP file didn’t work. It just stops. Downloading from GITHUB worked, except users should be aware that they’ll need to update their version of R. Small thing: the example code you give at the top for hacker etc doesn’t include require(ggplot2). I am going to write up my own example and will send you a link. Thanks for all this!

  7. Stubborn Mule Post author

    @Stephen: what problem did you have with the ZIP install? My testing looked like this:

    > library(devtools)
    > install_local("~/Downloads/ngramr-master.zip")
    Installing package from ~/Downloads/ngramr-master.zip
    Installing ngramr
    '/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL  
      '/private/var/folders/0h/3l97r8gd48jbm5mlhj_2pmf80000gn/T/RtmpkReS0d/ngramr-master'  
      --library='/Library/Frameworks/R.framework/Versions/3.0/Resources/library' --with-keep.source  
      --install-tests 
    
    * installing *source* package 'ngramr' ...
    ** R
    ** preparing package for lazy loading
    ** help
    *** installing help indices
    ** building package indices
    ** testing if installed package can be loaded
    * DONE (ngramr)
    
  8. Pingback: Words and culture | GIS and Statistics at KPU

  9. Maxine

    This is fantastic! I was just wondering how R goes about dealing with accents? This seems to be somewhat of a barrier for working with the non-English corpora. Thanks again for such a useful bit of code!

  10. Stubborn Mule Post author

    While I have not tested this extensively, R, Google ngrams and ngramr seem to behave ok with accents. For example, this seems to work fine:

    ggram(“soufflé”, corpus=”fre_2012″, year_start=1800)

  11. Maxine

    Thanks for replying! That’s funny, I can see that works fine but when I look at certain words in I get an error. For example:
    ggram(“fécondité”, corpus=”fre_2012″, year_start=1800)
    gives me:
    Error in data.frame(…, check.names = FALSE) :
    arguments imply differing number of rows: 209, 0
    Any ideas?

  12. Stubborn Mule Post author

    @Maxine. I tried

    ggram(“fécondité”, corpus=”fre_2012″, year_start=1800)

    and it worked. Here is the result. Have you got the latest version of ngramr (and other packages)? If you are using RStudio (which I would highly recommend!) this can be done via Tools -> Check for Package Updates.

  13. Maxine

    Thanks, at least I know it’s on my end! I’ve just started using RStudio and have the same issue. For some reason my R is just ignoring the accents, as when I try “soufflé” for example I get the results for “souffl”, and words with an accent within the word (like fécondité) are returning no results, hence the error. Very strange.

  14. Stubborn Mule Post author

    @Maxine I think I have got to the bottom of the problem: a different approach to character encoding on Windows. I think I have sorted out a fix, so will submit it to CRAN. I will keep you posted.

  15. Matt

    Thanks for writing such a useful package! While working with more obscure words, I’ve encountered a potential bug with the ngrami function. The line
    >ngram(“pulser”, corpus=”fre_2012″)
    returns the expected full dataset while the case insensitive function,
    >ngrami(“pulser”, corpus=”fre_2012″)
    returns an error. I believe this is because it is trying to combine the results from “pulser” and “Pulser”, the latter of which is empty.
    (When I enter >ngram(“Pulser”, corpus=”fre_2012″), it returns an error because there aren’t any instances of it in the corpus.) I’m using some workarounds, but I figure a fix is possible in the code itself.
    Thanks for your help,
    Matt Blackshaw

  16. will

    Have you have any experience with this error message?

    Error in fromJSON(sub(“.*=”, “”, html[data_line])) :
    CHAR() can only be applied to a ‘CHARSXP’, not a ‘pairlist’

    It seems to be caused by repeated ngram calls — I encountered it in a loop to build a matrix of more than 12 ngrams. Is there a capacity constraint build in by Google?

  17. Stubborn Mule Post author

    I have not seen that error, but it is certainly possible that there is a capacity constraint: Google changes the way its pages work quite often! Can you post some sample code?

  18. suz

    G’day! Thanks for the great work with the package, I love it. Here are two questions from a noob, who wants to query a few dozen words at once:

    1. I tried to build a for-loop (not experienced with looping, I admit), but as you need to quote the phrase, I’m unsure on how you’d call ngram() with indexes. I couldn’t find any discussion apart from will’s comment above, so I guess it is possible to loop over a vector with strings. Mine’s not a capacity problem (not there yet!). I just get the error “is.character(phrases) is not TRUE”. I tried ngram(‘”‘cat[i]'”‘), corpus=”eng_us_2012″), but I guess my fault lies elsewhere; like not understanding for-loops. (I only have a few dozen words I’d like to query.) Any suggestions on this issue?

    2. What am I doing wrong with the ngrami() function? It returns a line “Browse[1]> ” that expects user input?

  19. suz

    Thanks! Q1 seems somewhat solved; it works fine with sapply(). So my problem has more to do with a misunderstanding of for-loops, I guess.

  20. Stubborn Mule Post author

    There is a problem with the ngrami code. I’ve submitted an update to CRAN, so check for an update in 24 hours or so. Thanks for picking it up!

  21. catphish

    Great script! Quick question: I can’t get any data back for phrases/words that include apostrophes. To process these, Google adds a space (e.g. “X ‘s Y”), but even when I do this the script skips the phrase. Any ideas?

Leave a Reply