How common are common words?

One of my favourite podcasts is Slate’s Lexicon Valley. All about language, it is rigorous and detailed in its approach to the subject, which appeals to the closet academic in me, but also extremely entertaining. It is a sign of a good podcast to find yourself bursting out laughing while walking down a busy city street. Lexicon Valley is to blame for numerous moments of alarm for my fellow commuters.

In September last year, hosts Mike Vuolo (the knowledgeable one) and Bob Garfield (the funny one) interviewed linguist Geoffrey Nunberg, talking to him about his recent book, Ascent of the A-Word: Assholism the First Sixty Years. A half hour discussion of the evolution of the word “asshole”helps earn this podcast an “Explicit” tag in the iTunes store and, as a result, this will be the first Stubborn Mule post that may fall victim to email filters. Apologies in advance to anyone of a sensitive disposition and to any email subscribers this post fails to reach.

Nunberg traces the evolution of “asshole” from its origins among US soliders in the Second World War through to its current role as a near-universal term of abuse for arrogant boors lacking self-awareness. Along the way, he explores the differences between profanity (swearing drawing on religion), obscenity (swearing drawing on body parts and sexual activity) and plain old vulgarity (any of the above).

The historical perspective of the book is supported by charts using Google “n-grams”. An n-gram is any word or phrase found in a book and one type of quantitative analysis used by linguists is to track the frequency of n-grams in a “corpus” of books. After working for years with libraries around the world, Google has amassed a particularly large corpus: Google Books. Conveniently for researchers like Nunberg,with the help of the Google n-gram Viewer, anyone can analyse n-gram frequencies across the Google Books corpus. For example, the chart below shows that “asshole” is far more prevalent in books published in the US than in the UK. No surprises there.

"Asshole" frequency US vs UKUse of “asshole” in US and UK Books

If “asshole” is the American term, the Australian and British equivalent should be “arsehole”, but surprisingly arsehole is less common than asshole in the British Google Books corpus. This suggests that, while being a literal equivalent to asshole, arsehole really does not perform the same function. If anything, it would appear that the US usage of asshole bleeds over to Australia and the UK.

Asshole/Arsehole frequencies“asshole” versus “arsehole”

Intriguing though these n-gram charts are, they should be interpreted with caution, as I learned when I first tried to replicate some of Nunberg’s charts.

The chart below is taken from Ascent of the A-word and compares growth in the use of the words “asshole” and “empathetic”. The frequencies are scaled relative to the frequency of “asshole” in 1972* . At first, try as I might, I could not reproduce Nunberg’s results. Convinced that I must have misunderstood the book’s explanation of the scaling, I wrote to Nunberg. His more detailed explanation confirmed my original interpretation, but meant that I still could not reproduce the chart.

Nunberg's chart: asshole versus empathy

Relative growth of “empathetic” and “asshole”

Then I had an epiphany. It turns out that Google has published two sets of n-gram data. The first release of the data was based on an analysis of the Google Books collection in July 2009, described in the paper Michel, Jean-Baptiste, et al. “Quantitative analysis of culture using millions of digitized books” Science 331, No. 6014 (2011): 176-182. As time passed, Google continued to build the Google Books collection and in July 2012 a second n-gram data set was assembled. As the charts below show, the growth of “asshole” and “empathetic” is somewhat different depending on which edition of the n-gram data set used. I had been using the more recent 2012 data set and, evidently, Nunberg used the 2009 data set. While either chart would support the same broad conclusions, the differences show that smaller movements in these charts are likely to be meaningless and not too much should be read into anything other then large-scale trends.

Empathy frequency: 2009 versus 2012Comparison of the 2009 and 2012 Google Books corpuses

So far I have not done very much to challenge anyone’s email filters. I can now fix that by moving on to a more recent Lexicon Valley episode, A Brief History of Swearing. This episode featured an interview with Melissa Mohr, the author of Holy Shit: A Brief History of Swearing. In this book Mohr goes all the way back to Roman times in her study of bad language. Well-preserved graffiti in Pompeii is one of the best sources of evidence we have of how to swear in Latin. Some Latin swear words were very much like our own, others were very different.

Of the “big six” swear words in English, namely ass, cock, cunt, fuck, prick and piss (clearly not all as bad as each other!), five had equivalents in Latin. The only one missing was “piss”. It was common practice to urinate in jars left in the street by fullers who used diluted urine to wash clothing. As a result, urination was not particularly taboo and so not worthy of being the basis for vulgarity. Mohr goes on to enumerate another five Latin swear words to arrive at a list of the Roman “big ten” obscenities. One of these was the Latin word for “clitoris”, which was a far more offensive word than “clit” is today. I also learned that our relatively polite, clinical terms “penis”, “vulva” and “vagina” all derive from obscene Latin words. It was the use of these words by the upper class during the Renaissance, speaking in Latin to avoid corrupting the young, that caused these words to become gentrified.

Unlike Nunberg, Mohr does not make use of n-grams in her book, which provides a perfect opportunity for me to track the frequency of the big six English swear words.

Big 6 SwearwordsFrequency of the “Big Six” swear words

The problem with this chart is that the high frequency of “ass” and “cock”, particularly in centuries gone by, is likely augmented by their use to refer to animals. Taking a closer look at the remaining four shows just how popular the use of “fuck” became in the second half of the twentieth century, although “cunt” and “piss” have seen modest (or should I say immodest) growth. Does this mean we are all getting a little more accepting of bad language? Maybe I need to finish reading Holy Shit to find out.

Big 4 Swear WordsFrequency of four of the “Big Six” swear words

* The label on the chart indicates that the reference year is 1972, but by my calculations the reference year is in fact 1971.

Possibly Related Posts (automatically generated):

24 thoughts on “How common are common words?

  1. apj

    Mule, never have I read a blog and experience such an array of reactions. Mildly amused to start, lol at the ‘bleeding arse’ before lmfao at the arsehole/asshole charts, before settling into a serious and analytical pose as I tried stick with your discussion of 2009-12 data sets, before going ‘off again’ with the ‘big 6’ charts. Just awesome.

  2. Senexx

    Best. Post. Ever.

    And surprisingly not for the vulgarity. I’m somewhat of an amateur etymologist as well and love this stuff.

    I must ask though as there is an ongoing debate between my mates whether being called a “cunt” or a “dick” is more offensive. What say you humble readers of Stubborn Mule?

  3. apj

    Well I even describe myself as a ‘d’ sometimes, but never a ‘c’ – not much room for humour with the latter …

  4. James

    Interesting that the writers of Battlestar Galactica got away with people saying “fuck” all the time by changing it to “frack” which became a mild swear word in its own right (at least amongst Big Bang Theory fans) . Similarly on “Father Ted” it is amusing rather than shocking to hear Ted use the gaelic “feck”. Or you can get away with the whimsical Irish “shite”. Following the success of “frack” the new SyFy series “Defiance” made up the word “shtako”. They risked reviewers describing it as “shtako” which some did.

    On the female pudenda I have 2 cats which are both called Minoo (which in cat speak means “food”). “Minou, Minou!” is what French housewives yell out to call their cats for dinner. It is the equivalent of the English “puss, puss” and incidentally has the same alternative meaning as “pussy” (and occasionally rendered in English as such as “minnie”)

  5. Stubborn Mule Post author

    @James along with cock/penis and cunt/vagina your examples highlight the curious fact that, apart from amongst the most prudish, it is the word itself that has the power to offend not the meaning.

  6. Senexx

    And thanks I’ve been trying to look up how shtako is spelled, I went more with shtucco as in stucco. It is still no gosa.

  7. Ken Oath

    Sean, as fuck/feck/frak and shit/shite or even piss/pee shows it even okay to use a word which almost sounds identical. They are only rude words because we decide they are rude words. It’s therefore only partly the designation of taboo bodily functions and parts. I think its origin is a primate urge to express anger or angst in a gesture which, being a verbal species, expresses itself in our case in the language of the toilet and toilette (and the occasional finger or two when you can’t be heard). My guess is other primates also have “fuck you shit head” gestures. Perhaps that is what monkeys are doing, in a more literal way, when they fling feces at each other?

  8. James

    Sennex – I think if you google “shtako defiance” it is correct. And I apologise, it’s of course “frak” not “frack”, by gods. Which of course the anti-fracking movement have appropriated as the slogan “frack off”.

  9. Senexx

    James, indeed I did after I saw your post. You are correct. Other derivatives never paid off, not even with a tentative link to defiance.

    I’m a latecomer to Defiance but if someone had said to me it had Grant Bowler, Julie Benz, Rockne S O’Bannon I would’ve been there from day 1.

    Either way it’s said frak/frack – so have been dismayed with both sides of the fracking movement for reasons you clearly outline

  10. James

    Sean, question: which of the following is worse:
    (i) calling someone an “arsehole” or “cunt”;
    (ii) calling someone a “nigger” or a “faggot”.
    I suspect that, even in your temporary relaxation of rules of using offensive language, you would draw the line at the latter.

  11. Spinksy

    Interesting stuff – unlike the other three of your big four, ‘prick’ can be used outside of a profane context, eg to prick your finger, reflected in it’s refusal to flatline during the 1800s. Armed with this information, why did it fall and rise twice between 1800 and 2000? Something related to a resurgence in needlepoint?

  12. Senexx

    OT as well.

    I’m watching Continuum – think it has gone downhill though. Was a latecomer to Revolution as well but now I’ve watched it the season final was a doozy.

    Spinsky, nice question

  13. Stubborn Mule Post author

    Elsewhere she writes “I had surprisingly little problem in writing fuck over and over and over, but I balked at thinking about and discussing the n-word.” Despite this post, you are correct. I am similarly coy.

  14. Stubborn Mule Post author

    @Spinksy very interesting question. Making use of more advanced features of the Google Ngram Viewer you can see the usage of “prick” as a noun as a proportion of all uses of “prick”. In earlier times, it appears to have been used more as a verb than it is now. Also, earlier usage of prick in slang may have been broader . According to the online dictionary of etymology, “My prick was used 16c.-17c. as a term of endearment by “immodest maids” for their boyfriends”.

  15. James

    One problem I can see with the whole Ngram thing is that they don’t weight it by the number of copies sold or read? So the bible, koran or, indeed, Harry Potter, get the same weight as an unread Patrick White novel? If the frequency of use in published novels is a proxy for general usage I think it is a pretty poor indicator. It will be biased towards words used in academic/intellectual novels but which are rarely used in general speech.

    For example take the words “efficacy/efficacious” and “efficient/efficiency”. The latter currently has an ngram frequency of about 6 times the former (combining scores from both). They aren’t totally identical in meaning but are frequently used to mean the same thing. I use “efficacy” as the prime example of a word which, while occasionally seen in print, is almost never used in conversation (except by me, purely in order to be perverse).

  16. Stubborn Mule Post author

    James you make good points here, although there is a partial remedy. Google provides multiple corpuses (more will be revealed in a follow-up post for the more technically inclined) including one restricting to works of fiction only.

  17. Ken

    One interesting detail is the use of “fuck” prior to 1800, is actually due to the often typesetting of “s” as “f”, so they are actually “suck”. This is something I would have expected Google to fix, as there must be a large number of problems due to this and similar typesetting problems.

    Out of interest, the phrase “anal sex” is almost unknown before 1970. By this stage I’m wondering how many people can’t read your blog due to internet filters.

  18. Stubborn Mule Post author

    @Ken interesting s/f insight! Also, I have been told by a number of email subscribers that the post didn’t make it through their email filters.

  19. Pingback: ngramr – an R package for Google Ngrams

  20. Pingback: Profanity, Obscenity, and Vulgarity | Be Rational, Get Real

Leave a Reply