In November 2007, a family from Ohio was blessed with the birth of their third child Kayla, who arrived on October 2nd. Kayla will share the same birthday as her two older siblings who were also born on October 2nd, in 2003 and 2006 respectively. Bill Notz, a statistics professor at Ohio State University concluded that the odds of a family having three children born on the same date in different years are less than 8 in a million. That said, the Paez family from San Diego, California, also had their third child born on July 30th of this year. Are these events extraordinary, or just plain ordinary?
For some time I have been wanting to start a series of posts on Weapons of Math Instruction for IT Risk. The name comes from a joke I saw a few a years ago about the evil Al-gebra group. I will be aiming for less math and more instruction since the text format of the blog is somewhat formula-challenged. The first post will be on the Birthday Paradox, a perennial favourite, recently recounted to me by a senior manager. Also there was a recent furore over the reported odds of DNA matching which was a misinterpretation of the Birthday Paradox at heart.
The Birthday Paradox (BP) poses the following simple question: how many people need be gathered in a room so that the chance of any two people sharing a birthday is at least 50%? We assume that each person is equally likely to be born on each of the 365 days of the year , and we ignore leap years. Given these assumptions, the surprising answer is 23. The BP is not a true paradox (a logical contradiction) but is rather deeply counterintuitive since 23 seems too small a number to produce a common birthday. Keep this in mind next time you are watching a football (soccer!) game which has 23 people on the field (the two teams of 11 plus the referee). If you want to see the mathematics behind the result, Google spoils for choice with over 53,000 hits. As is often the case with mathematical topics, the Wikipedia article has an excellent explanation, graphics and references.
The BP is an archetypal problem in probability whose solution is less important than the principles it demonstrates. What is the lesson? If we are trying to obtain a match on an attribute with M distinct values, then just under sqrt(2*M) uniform samples are required for better than 50% success. For birthdays, M = 365 and sqrt(2*365) = 27.1, a little bit higher than the correct value of 23. To generalise, if you were looking for a match amongst 1,000,000 attributes then you would need less than 1500 uniform samples - probably lower than you would have estimated/guessed. Where does the sqrt(2*M) term come from?
Well if we have N objects to compare for a potential match, then the number of possible comparisons is N*(N-1)/2. This number can be approximated as (N^2)/2, which grows in proportional to the square of N, and is therefore called a quadratic function. The source of the surprise in the BP is that we are not thinking quadratically, and therefore underestimate the number of potential matches. When N = sqrt(2*M) then (N^2)/2 yields M possible matches. If we assume that the probability of a match is 1/M, then the average number of matches over the M possibilities is M*(1/M) = 1. A closer analysis shows that there is a 50% chance of a match when N is approximately sqrt(2*M*ln2) where sqrt(2*ln2) = 1.18. You can find a straightforward 1-page derivation of a slightly higher bound here.
Johnny's Fallacy with DNA matching
Another reason why the BP is puzzling is that people often mistake the question to be the following: how many people need to be gathered in a room such that the chance of someone sharing a birthday with me is at least 50%? In this case the person is confusing a 1-to-N matching problem (the person to the group) with the required N-to-N matching problem (the group to itself). Famed US talk show host Johnny Carson made this mistake (as related here, p.79), which we will refer to as Johnny's Fallacy. One night there were about 100 people in his studio audience, and he started to search for someone with the same birthday as him. He was disappointed when no one with his birthday turned up. In 100 people the probability that someone would have the same birthday as Johnny is just under 1/4, while the probability that some pair of people share a birthday is 0.9999, essentially a certainty.
Recently there was a occurrence of Johnny's Fallacy on the topic of identifying suspects using DNA testing. The episode was reported in the Freakanomics blog under the title Are the FBI's Probabilities About DNA Matches Crazy?, reporting on a piece from the Los Angeles Times How reliable is DNA in identifying suspects?. In 2001, Kathryn Troyer, a crime lab analyst in Arizona, was running some tests on the state's DNA database and found two felons with remarkably similar genetic profiles. Remarkable in the sense that the men matched at nine of the 13 locations on chromosomes (or loci) commonly used to distinguish people, and that the FBI estimated the odds of such a match to be 1 in 113 billion. This is the 1-in-N probability of a DNA match. Since her initial discovery Troyer has found among about 65,000 felons, there were 122 pairs that matched at nine of 13 loci. The matches here are the result of the N-to-N probabilities. Johnny's Fallacy here would be to conclude that a search of the Arizona DNA database against the given DNA sample would return 122 matches.
Gradually Troyer's findings spread and raised doubts concerning the veracity of DNA testing. The FBI has declared that the Troyer searches (sounds like something from the X Files) are misleading and meaningless, which is not a particularly helpful assessment of Johnny's Fallacy. David Kaye, an expert on science and the law at Arizona State University, remarked that since people's lives are at stake based on DNA evidence, “It [the Troyer matches] has got to be explained.” Steven Levitt in the Freakanomics blog steps up to the plate to provide some numbers. He assumes that the likelihood of a match at a given loci is 7.5%, yielding the odds of a 13-loci match to be about 1 in 400 trillion, and 9-loci match to be 1 in 13 billion. So a DNA database of 65,000 people yields over 2 billion potential matches, and about 100 expected matches on at least 9 loci.
Levitt's numbers are examples to explain the fallacy, or actually a version of the BP where matching of a single personal trait has been substituted with a number of loci. If you google "Kathryn Troyer" there are over a 1000 hits, and her results have generated a storm of controversy. Reading a few of the articles returned by google shows that it will be some time before people fully understand the apparent paradox in this case.
As I mentioned in the introduction to this post, a senior manager recently told me that he used the BP as an example to get people thinking quantitatively in risk assessment workshops. Numbers come with their own logic - calculations can be surprising.