« Special Interests, Universal Appeal | Main | Where did blogs come from? »
How do numbers begin?
In many data series a surprising number of entries begin with the number 1, and the number 2 is also more common than a random distribution might suggest. This is called Benford's Law. For instance about one third of all house numbers start with one. That may be a quirk of bureaucratic numbering psychology, but the principle also applies to the Dow Jones index history, size of files stored on a PC, the length of the world’s rivers, and the numbers in newspapers’ front page headlines. It does not apply to lottery-winning numbers, see the graph at the above link. Here is an exact statement of the law:
Besides the number 1 consistently appearing about 1/3 of the time, number 2 appears with a frequency of 17.6%, number 3 at 12.5%, on down to number 9 at 4.6%. In mathematical terms, this logarithmic law is written as F(d) = log[1 + (1/d)], where F is the frequency and d is the digit in question.
I feel as if someone is pulling my leg. And I keep thinking of nominal interest rates being bounded from below at zero. Yes this has practical implications:
...because a year’s accounting data of a company should fulfill the law, economists can detect falsified data, which is very hard to manipulate to follow the law. (Interestingly, scientists found that numbers 5 and 6, rather than 1, are the most prevalent, suggesting that forgers try to “hide” data in the middle.)
The law was first discovered by an economist (and astronomer), Simon Newcomb. Here is Wikipedia on the law. Here is more startling data on where the law applies. From a completely orthogonal but I suspect not totally irrelevant direction, here is Tim Harford on price stickiness.
This whole topic makes me feel like an idiot for even bringing it up, with apologies to Pythagoras.
Posted by Tyler Cowen on May 13, 2007 at 01:34 AM in Science | Permalink
Comments
Hey Tyler,
It is a really interesting phenomenon. Check out Walter Mebane's work on using Benford's Law to detect fraud in elections.
Charlie
Posted by: Charlie at May 13, 2007 1:30:28 AM
You can also use Benford's Law to detect problems in survey data:
http://www.aae.wisc.edu/schechter/benford.pdf
Posted by: TimB at May 13, 2007 1:41:59 AM
Sadly though, my country's lottery numbers do not occur according to Benford's Law.
I checked through a book which claimed to contain the jackpot numbers for the last 30 years, and sadly, it looks like a random number table.
Ah, I'd have been so psyched if it did follow Benford's Law
Posted by: Chewxy at May 13, 2007 2:25:37 AM
Ach. You mentioned lotteries already XD
Posted by: Chewxy at May 13, 2007 2:26:59 AM
I rediscovered this after reading Mario Livio's
book about PHI, where it is nestled away in an appendix if I recall correctly. Amazing how much of everything in the world seems to be governed by PI, PHI, e, or some other transcendental number.
Posted by: Zach B. at May 13, 2007 2:54:56 AM
Um, Zach B., Phi isn't transcendental. Irrational, yes, but it's an algebraic number.
Posted by: Grant Gould at May 13, 2007 7:10:29 AM
Wow:
“In 1961, Pinkham discovered the first general relevant result, demonstrating that Benford’s law is scale invariant and is also the only law referring to digits which can have this scale invariance,” the scientists wrote in their letter. “That is to say, as the length of the rivers of the world in kilometers fulfill Benford’s law, it is certain that these same data expressed in miles, light years, microns or in any other length units will also fulfill it.”
Should I start believing in God?
Posted by: Phoebe at May 13, 2007 8:55:13 AM
A friend of mine once claimed that touristy stores selling ceramic digits (perhaps to put on the front of your house) tend to be out of 1's.
Posted by: fmb at May 13, 2007 9:34:17 AM
Also, I think this is closely related to the problem about envelopes with money in them where one has half the money of the other. Pick one, say it has $x -- should you switch?
Posted by: fmb at May 13, 2007 9:35:56 AM
"That may be a quirk of bureaucratic numbering psychology"
Or, in the case of house numbering, it may be the result of the convention of numbering homes in order combined with the average street being far more likely to contain between 10 and 20 homes than between 50 and 100 homes. In such a case, almost every street, court, etc. is likely to have homes numbered in the '10s', most also in the '20s', fewer in the "30s', etc.
Penford's law is surprising in many other contexts, such as lengths of rivers, but not in something that is the result of ordinal numbering, such addresses.
Posted by: M. Hodak at May 13, 2007 10:13:03 AM
It's actually enough if the street is equally likely to contain 10-20 homes and 50-100 homes. Then, the probability of the first digit being 1 (10-20 homes) is equal to the sum of probabilities of the first digit being 5, 6, 7, 8, 9 (50-100 homes) and larger than any of those probabilities alone.
This explains Dow Jones index case very well. Over long periods of time, the index should be spending the same amount of time between 1000 and 1999 (first digit 1) as between 2000 and 3999 (first digit 2 or 3) or between 5000 and 9999 (first digit 5, 6, 7, 8 or 9). As a result, the first digit 1 is much more likely.
Posted by: Adrian at May 13, 2007 10:35:30 AM
Does the length of rivers follow this rule whether measured in miles, km, or feet?
Has anyone checked this formula?
If F(d)=log(1+1/d) then F(1)=0.69, not 1/3, and the the sum for d=1,...,9 is ~2.3. Also, F(0)=log(\infty)=\infty.
Posted by: o at May 13, 2007 1:25:34 PM
Grant Gould: Yes I realized that this morning, and said a big "D'oh" for me. This is the problem with typing anything or in fact trying to think at 3AM with half a bottle of grape juice in you that seems to be turning to vinegar. Or I suppose I could have argued that the square root of five is transcendental in some other logic system, the derivation of which is left as an exercise for the reader.
But I have often wondered about the prevalence of certain of any type of numbers. I know rationally it doesn't really mean anything, and asking questions like "Why is the fine structure constant so close 1/137?" will yield no reason. There is a very good combinatorial reasons for Benford's law, but if there wasn't that sense of wonder, I doubt Tyler would have bothered with the post in the first place.
Phoebe: Only if you already did believe, in which case all such phenomena would generally profit from being ascribed to some extrasensory force, rather than chance or mathematics, which would tend to make one not believe in such things.
Posted by: Wreck of the Zach B. at May 13, 2007 1:36:01 PM
o, obviously you divide .69 by 2.3 to get "about a third," and leading digits by definition are non-zero.
Posted by: Barbar at May 13, 2007 2:23:13 PM
o: F(1) = log(2) = 0.301.
F(1) /= ln(2) = 0.69.
Base 10 logs, not natural logs.
Posted by: bartman at May 13, 2007 3:31:44 PM
Wikipedia mentions the famous observation that, from looking at the edges of the pages, one concludes that the first pages of tables of logarithms get much more use than the back pages. Since the obvious explanation, that the first pages have some pornographic content, is apparently not true, one must seek other explanations.
Posted by: MattF at May 13, 2007 5:12:37 PM
Phoebe,
I believe in God, but I don't think that the physical applications of Benford's law support His existence.
The end conclusion from Benford's law merely says that nearly all sets of numbers follows a logarithmic distribution. According to the articles posted, it has been rigorously proven that the "mixing" of many different distributions will approach a logarithmic distribution. Therefore, a logarithmic distribution of, say, the lengths of rivers only says that the lengths of rivers depend on many different variables which, in turn, have many different type of distributions.
As far as I can see, Benford's law only has application in detecting the application of non-logarithmic distribution on data which should depend on many different variables with different distributions. For example, numbers in a company's annual report depend on many different factors and data with a uniform distribution indicates unnatural fraud.
Posted by: Matthew at May 13, 2007 10:23:03 PM
Please tell me someone has considered this explanation:
Benford's Law arises as an artifact of how, as numbers increase, they must *first* "go through 1". If the values are evenly distributed up to an arbitrary number, they will, on average, be heavily skewed toward lower digits because the remaining digits (before it must start over again) are not reached.
This would explain why it doesn't affect lotto numbers: they are defined to have a fixed number of entries, and the randomization occurs across each entry (which is evenly distributed in terms of digits) rather than for one number across a continuous part of the number line.
Posted by: Person at May 13, 2007 10:35:45 PM
This is because we write in base ten. It's really that simple.
Person's posts is correct. I can't understand if you all really thought an economist found a previously unknown law that is a definition about how counting works.
Posted by: anonymous at May 14, 2007 12:30:14 AM
The law is intriguing when it applies to data that *doesn't* involve ordered sets. It is easy enough to understand for house numbers, but why should birth rates, area of countries and size of files stored on a PC first go through 1? And why should they follow the law even if you multiply all the values by 7.3? It is easy to state that something is obvious after you know it to be true, but this is not trivial at all.
Posted by: Phoebe at May 14, 2007 2:14:56 AM
From Wikipedia: "and similarly for [longer numbers] without leading zeros"
Just as I suspected - zeros would be in the lead, but for their being actively suppressed - no doubt at the behest of the digit that likes to think of itself as "the One."
Nothingness is everywhere!
Posted by: Notes from the existentialist underground at May 14, 2007 7:27:05 AM
Bedford's law is base invariant--it applies just as well in base 5, 8, 12 or 60 as base ten. Of course, the base for you logarithm must be adjusted...
Posted by: Nathan Zook at May 14, 2007 8:47:52 AM
Note that made-up numbers are going to fail to follow Benford's law, but that numbers created by a crooked process that follows the same sort of pattern as a real process, as far as scaling and such, will follow it just fine. That is, if you just make up the number of votes for each candidate in your county, you can be caught, but not (by Benford's law, anyway) if you just randomly switch every tenth Gore vote to be a Bush vote instead.
Posted by: albatross at May 14, 2007 10:57:16 AM
Well, if you're going to bring that up, better be prepared for what you find...
http://www.rense.com/general5/fraud.htm
Posted by: Nathan Zook at May 14, 2007 1:09:06 PM
This is because we write in base ten. It's really that simple.
Note that, in base two, every number starts with 1. In base infinity, exactly one number starts with 1.
This effect pretty much mirrors the fact that the numbering system we use is itself logarithmic.
As suggested, it is very much like the two-envelope puzzle mentioned above. Every random number comes from a distribution. And every random distribution comes from a distribution of distributions. The net result is logarithmic.
Posted by: MikeP at May 14, 2007 1:27:29 PM
Matthew nails the explanation. Random numbers don't follow Benford's law because they have a uniform distribution. Most data sets will have a logarithmic distribution or something very close to that.
Whether Benford's law applies depends on how one is supposed to have gotten the numbers. Financial or natural (i.e. lengths of rivers, populations of species etc.) data tends to have a lot of logarithmic components (growth is logarithmic for instance), and/or be the result of many different components with many different distributions which with certain givens will always converge on a logarithmic distribution.
I wanted to see that proof for myself, so I just did a little fooling around and found the relevant paper by Ted Hill if anyone is interested.
Posted by: Michael Sullivan at May 14, 2007 4:10:33 PM
Thanks for the link, Michael Sullivan.
Posted by: Bernard Guerrero at May 14, 2007 4:45:38 PM
Oh great VB vs C enumeration debate.
1 based indexing is just stupid, and has caused a whole raft of bugs called off-by-one errors.
Granted some of that is due to confusion between the >= and > operators, but still.
Enumerated collections should be indexed at zero, and counted by 1. The confusion between these two principles is the fundamental issue.
One query asks for how many elements there are, and the other is using an offset to do set based math. The confusion between these two concepts has led to a lot of bugs.
Posted by: bago at May 15, 2007 7:13:12 AM
Anyone noticed Microsoft's assertions of patents violated show a preponderance of 5s and 6s?
Posted by: Lord at May 15, 2007 4:45:16 PM
Heh.
Related: Microsoft Patents Ones, Zeroes
Posted by: MikeP at May 15, 2007 6:10:45 PM
Lord: made me guffaw.
Posted by: fustercluck at May 15, 2007 8:15:47 PM
As a side note and only peripherally related to this topic, read the article in the New Yorker on the Piraha people in the Amazonian rain basin. They have 1, 2 and many. They literally cannot count above 2. There are a number of other language issues in the article, but common crows do better in counting tests than they do.
Posted by: Murphy at May 16, 2007 2:21:45 PM
Which is an excellent demonstration of the theory (very popular among mathematicians) that vocabulary limits conceptualization.
Posted by: Nathan Zook at May 16, 2007 10:31:37 PM
Bullhockey on the "vocabulary limits conceptualization." It's been shown that a Piraha tribesman can be asked, say, "Give me as many flowers as birds you shot yesterday," and the tribesman will come out with, say, six flowers, and it will equal the number of birds, even though there is no actual word "six" in the language. Saying that Piraha can't count doesn't mean a Piraha woman doesn't know how many children she has if she has more than two, either. Geesh.
Posted by: speedwell at May 20, 2007 12:42:21 AM
Posted by: at Jul 12, 2007 12:44:45 PM
It is not vocabulary that the Piraha language lacks. It is recursion. The Piraha have vocabulary to express everything that is important to their culture. English has no word for a muddy river flowing toward the jungle because recursion in the English language makes such a word, describing something that is not important to the cultures of English speaking people, unnecessary.
Posted by: Eve at Jul 12, 2007 12:49:19 PM
Posted by: 鑽石 at Apr 2, 2008 8:25:37 PM





