Benford’s Law

by on 16 April 2012 · 24 comments

Here is a quick quiz. If you visit the Wikipedia page List of countries by GDP, you will find three lists ranking the countries of the world in terms of their Gross Domestic Product (GDP), each list corresponding to a different source of the data. If you pick the list according to the CIA (let’s face it, the CIA just sounds more exciting than the IMF or the World Bank), you should have a list of figures (denominated in US dollars) for 216 countries. Ignore the fact that the European Union is in the list along with the individual counties, and think about the first digit of each of the GDP values. What proportion of the data points start with 1? How about 2? Or 3 through to 9?

If you think they would all be about the same, you have not come across Benford’s Law. In fact, far more of the national GDP figures start with 1 than any other digit and fewer start with 9 than any other digit. The columns in the chart below shows the distribution of the leading digits (I will explain the dots and bars in a moment).

Distribution of leading digits of GDP for 216 countries (in US\$)

This phenomenon is not unique to GDP. Indeed a 1937 paper described a similar pattern of leading digit frequencies across a baffling array of measurements, including areas of rivers, street addresses of “American men of Science” and numbers appearing in front-page newspaper stories. The paper was titled “The Law of Anomalous Numbers” and was written by Frank Benford, who thereby gave his name to the phenomenon.

Benford’s Law of Anomalous Numbers states that that for many datasets, the proportion of data points with leading digit n will be approximated by

log10(n+1) – log10(n).

So, around 30.1% of the data should start with a 1, while only around 4.6% should start with a 9. The horizontal lines in the chart above show these theoretical proportions. It would appear that the GDP data features more leading 2s and fewer leading 3s than Benford’s Law would predict, but it is a relatively small sample of data, so some variation from the theoretical distribution should be expected.

As a variation of the usual tests of Benford’s Law, I thought I would choose a rather modern data set to test it on: Twitter follower numbers. Fortunately, there is an R package perfectly suited to this task: twitteR. With twitteR installed, I looked at all of the twitter users who follow @stubbornmule and recorded how many users follow each of them. With only a relatively small follower base, this gave me a set of 342 data points which follows Benford’s Law remarkably well.

;

Distribution of leading digits of follower counts

As a measure of how well the data follows Benford’s Law, I have adopted the approach described by Rachel Fewster in her excellent paper A Simple Explanation of Benford’s Law. For the statistically-minded, this involves defining a chi-squared statistic which measures “badness” of Benford fit. This statistic provides a “p value” which you can think of as the probability that Benford’s Law could produce a distribution that looks like your data set. The follower-count for @stubbornmule is a very high 0.97, which shows a very good fit to the law. By way of contrast, if those 342 data points had a uniform distribution of leading digits, the p value would be less than 10-15, which would be a convincing violation of Benford’s Law.

Since so many data sets do follow Benford’s Law, this kind of statistical analysis has been used to detect fraud. If you were a budding Enron-style accountant set on falsifying your company’s accounts, you may not be aware of Benford’s Law. As a result, you may end up inventing too many figures starting with 9 and not enough starting with 1. Exactly this style of analysis is described in the 2004 paper The Effective Use of Benford’s Law to Assist in Detecting Fraud in Accounting Data by Durtshi, Hillison and Pacini.

By this point, you are probably asking one question: why does it work? It is an excellent question, and a surprisingly difficult and somewhat controversial one. At current count, an online bibliography of papers on Benford Law lists 657 papers on the subject. For me, the best explanation is Fewster’s “simple explanation” which is based her “Law of the Stripey Hat”. However simple it may be, it warrants a blog post of its own, so I will be keeping you in suspense a little longer. In the process, I will also explain some circumstances in which you should not expect Benford’s Law to hold (as an example, think about phone numbers in a telephone book).

In the meantime, having gone to the trouble of adapting Fewster’s R Code to produce charts testing how closely twitter follower counts fit Benford’s Law, I feel I should share a few more examples. My personal twitter account, @seancarmody, has more followers than @stubbornmule and the pattern of leading digits in my followers’ follower counts also provides a good illustration of Benford’s Law.

One of my twitter friends, @stilgherrian, has even more followers than I do and so provides an even larger data set.

Even though the bars seem to follow the Benford pattern quite well here, the p value is a rather low 5.5%. This reflects the fact that the larger the sample, the closer the fit should be to the theoretical frequencies if the data set really follows Benford’s Law. This result appears to be largely due to more leading 1s than expected and fewer leading 2s. To get a better idea of what is happening to the follower counts of stilgherrian’s followers, below is a density* histogram of the follower counts on a log10 scale.

There are a few things we can glean from this chart. First, the spike at zero represents accounts with only a single follower, accounting around 1% of stilgherrian’s followers (since we are working on a log scale, the followers with no followers of their own do not appear on the chart at all). Most of the data is in the range 2 (accounts with 100 followers) to 3 (accounts with 1000 followers). Between 3 and 4 (10,000 followers), the distribution falls of rapidly. This suggests that the deviation from Benford’s Law is due to a fair number users with a follower count in the 1000-1999 range (I am one of those myself), but a shortage in the 2000-2999 range. Beyond that, the number of data points becomes too small to have much of an effect.

Histogram of follower counts of @stilgherrian’s followers

Of course, the point of this analysis is not to suggest that there is anything particularly meaningful about the follower counts of twitter users, but to highlight the fact that even the most peculiar of data sets found “in nature” is likely to yield to the power of Benford’s Law.

* A density histogram scales the vertical axis to ensure that the histogram covers a region of area one rather than the frequency of occurrences in each bin.

Possibly Related Posts (automatically generated):

{ 23 comments… read them below or add one }

1 Ramanan April 16, 2012 at 10:28 pm

Good post!

So the next time, I commit a fraud, I will try to verify whether my cooked up numbers follow the Benford Law!

From Wikipedia:

“Benford’s law has been invoked as evidence of fraud in the 2009 Iranian elections. However, other experts consider Benford’s law essentially useless as a statistical indicator of election fraud in general. Similarly, the macroeconomic data the Greek government reported to the European Union before entering the Euro Zone was shown to be probably fraudulent using Benford’s law, albeit years after the country joined.”

2 Stubborn Mule April 16, 2012 at 10:36 pm

@Ramanan: I suppose that the more informed a fraudster is, the more effective their frauds. There may well be a lot of excellent undetected frauds going on out there!

As for the use of Benford’s Law in detecting election fraud, I am skeptical. I read an article using Benford’s Law to analyse the recent Russian election. However, a couple of problems are identified. One, the range of data values it probably to narrow for Benford’s Law to apply and two, more importantly, the mechanism of possible fraud is probably not simply making up numbers:

bear in mind that all these tests are only to check if people have made up numbers after the votes have been counted, and won’t detect people stuffing dodgy voting slips into ballot boxes

3 Danny Yee April 17, 2012 at 12:08 am

Your twitter data have been fabricated. You actually only have a few dozen followers, the rest are artificial creations of twitter designed to make the company seem more important than it actually is…

4 Danny Yee April 17, 2012 at 12:10 am

P.S. How did you get so many followers? I’m stuck at about 200 and – more surprisingly to me – the twitter feed for my book reviews, @DannyReviews – has barely 80 followers.

5 Tyler Rinker April 17, 2012 at 3:12 am

Thanks for the informative post and simple visual display of the theory. I definitely learned something today.

6 Ramanan April 17, 2012 at 4:20 am

SM,

Good point from the Bad Science Blog. Also, I guess figuring out how to hide frauds makes committing frauds attractive :).

Btw, completely unrelated to this post: 13th is likely to occur more on a Friday than any other day.

7 Stilgherrian April 17, 2012 at 7:51 am

I’m pondering that spike of followers I have with just one follower each. It doesn’t seem like the sort of behaviour a human would experience. Maybe it’s something to do with spam?

8 Stubborn Mule April 17, 2012 at 8:47 am

@Stilgherrian I think that there’s a bit of a mixture of cases in the single follower category. Certainly some look like spammers (one in your list I just checked has been removed from twitter sometime since I ran the code), but others look legitimate, if a little lonely. Interestingly, there seem to be a lot of Indians in the list. But you can see for yourself.

9 Stubborn Mule April 17, 2012 at 8:52 am

@Danny of course, not all fabrications would fail Benford’s Law. The canny accounting fraudster could make up numbers by randomly selecting facts from Wikipedia (e.g. pick a river and look up it’s length in kilometres, pick a country and pick its GDP or population) and use the figures as bogus financial data. This would avoid the human tendency to spread leading digits too evenly.

As for twitter followers, I think that the main reason I have so many is that I have been on twitter for over 5 years. In fact, on that basis, I should have far more! While I don’t always follow my own advice, the best approach to getting more followers is to:

• engage (constructively) in conversation with a lot of users
• post interesting links (getting retweeted helps)
• post interesting tweets
• be very active on twitter
10 Zebra April 17, 2012 at 1:03 pm

Based on the stripey hat hypothesis I would not be surprised that the distribution looks okay but fails the significance test where the null hypothesis is that the underlying distribution is Benford’s Law. It will be close but as the distribution will invariably have a tail at each end this will ensure statistically significant, though small, departures from the strict Benford’s Law distribution. In fact based on the skewness of the distribution it might be possible to determine if Benford’s Law over or under-estimates the frequency of 1s. My guess is if the true distribution is skewed to the left then Benford’s Law will underestimate and if to the right it will overestimate. The Stillgherrian distribution looks like it is skewed to the left and indeed Benford’s Law slightly underestimates the frequency of 1s.

11 Stubborn Mule April 17, 2012 at 3:30 pm

@Zebra interestingly the high 1s low 2s looks like a broad-based phenomenon, not specific to @stilgherrian.

12 Magpie April 18, 2012 at 6:37 pm

@Stubby,

Interesting post. But this:

“The canny accounting fraudster could make up numbers by randomly selecting facts from Wikipedia (e.g. pick a river and look up it’s length in kilometres, pick a country and pick its GDP or population) and use the figures as bogus financial data. This would avoid the human tendency to spread leading digits too evenly.”

Couldn’t one just build a random numbers generator? The first digit would need to be generated by a Benford distribution. Perhaps the remaining digits, could be generated by a uniform distribution?

13 Stubborn Mule April 18, 2012 at 7:40 pm

@Magpie: there are generalisations of Benford’s Law for the 2nd, 3rd, 4th digit etc. So, you could use, say, a Poisson random number generator to determine how many digits your number should have and then use the various Benford distributions to generate each digit. By that point, though, it starts to seem like quite a bit of work! The beauty of Benford is that it occurs so widely in “nature” (which I interpret broadly to include twitter accounts!) and so you can sample from “real” distributions rather than using mathematical generators.

14 Ramanan April 18, 2012 at 7:51 pm

SM,

Interesting thing:

Imagine we redefine units (by a factor of 2) so that numbers with 1 as the leading digits start with 2. Does the result change?

No!

Because numbers with leading digits of 5, 6, 7, 8 and 9 will now have a leading digit of 1 after this rescaling.

It can be verified. The probabilities for 5 to 9 add up to:

P(5) + P(6) + P(7) + P(8) + P(9) =
0.079 + 0.067 + 0.058 + 0.051 + 0.046 = 0.31

which is exactly P(1) !

15 Stubborn Mule April 18, 2012 at 7:58 pm

@Ramanan that is in fact a very important observation. A lot has been written about this fact, namely that Benford’s Law is scale invariant (i.e. still holds when you scale the data, change units, say from miles to kilometres, etc.). In fact, the reverse is also true. Theodore Hill published a paper in 1995 entitled Base Invariance Implies Benford’s Law which shows that any distribution that is scale invariant must satisfy Benford’s Law for the leading digit.

16 Magpie April 18, 2012 at 7:59 pm

“there are generalisations of Benford’s Law for the 2nd, 3rd, 4th digit etc”

So, you mean these digits are not distributed independently from the preceding one? Say the second is dependent upon the first, the third upon the second and so on?

17 Stubborn Mule April 18, 2012 at 8:15 pm

@Magpie: yes indeed, that is the case, so you’d have to generate the digits in the right order. If your first digit is n then the probability that the second digit is m is

(log(10n + m + 1) – log(10n + m)) / (log(n + 1) – log(n)).

For example, if your first digit is 5, the probability of having a second digit 2 is

(log 53 – log 52) / (log 6 – log 5).

Gets worse for 3rd digit, and so on! But, the further out you go, the flatter the distribution gets.

18 Magpie April 18, 2012 at 9:37 pm

I’ll be damned…

19 Stubborn Mule April 18, 2012 at 9:46 pm

@Magpie: it does all take a bit of getting used to!

20 Ramanan April 19, 2012 at 7:17 am

Stubby @April 18, 2012 at 7:58 pm,

Yes, scale invariance very important to the whole discussion.

I realized that instead of saying 0.079 + 0.067+ … =0.301, I could better say it as

P(5) + P(6) … + P(9) =

log (6/5) + log (7/6) + .. log (10/9)
= log (6/5 * 7/6 *… * 10/9)
= log (10/5) = log (2)
= 0.3010
= P(1) !

(all logs to the base 10).

21 dan April 20, 2012 at 1:27 pm

some weekend listening for stubbsters:

22 Magpie June 11, 2012 at 9:22 pm

Stubby,

Tim Hartford (09-09-2011), Benford’s law and cooked Eurostatistics:

23 Stubborn Mule June 11, 2012 at 9:46 pm

Thanks for the link Magpie!

{ 1 trackback }

Previous post:

Next post: