Wednesday, November 19, 2008

On the DNS Birthday Probability

bb In a recent post I stated that a common reason why the Birthday Paradox (BP) seems puzzling is because people often confuse a 1-to-N matching problem (matching their birthday to someone else in a group) with the required N-to-N matching problem (matching a birthday amongst any pair of people in a group, including themselves). I called this the Johnny Fallacy after famed American TV Johnny Carson host mistook these two events. The post also looked at a recent variant of the Johnny Fallacy applied to DNA matching, and here we consider another variant applied to a cache poisoning attack on the Internet Domain Naming System (DNS).

DNS is a fundamental Internet service to resolve symbolic names into IP addresses (for example, DNS tells us that www.google.com is currently at IP address 209.05.129.99). While people are quite good at remembering symbolic names they are lousy at remembering IP addresses, and packets can't be routed using names like www.google.com. So DNS is the bridge between what people can remember and what Internet packets require.

When you attach your laptop to the Internet you are assigned a local (default) DNS server. As you enter new symbolic references to services, your laptop contacts your local DNS server for resolution, and your laptop stores the returned result in a local cache. If your local DNS server can't resolve your request then it passes it onto a "better informed" DNS server to for resolution. A resolution that involves a chain of DNS servers is called recursive. To your laptop the final resolution still looks like it was returned by your local DNS server. The trust between your laptop and your local DNA server is implicit - there is no formal authentication for resolution requests. What is returned by the local DNS server is accepted by your laptop as a correct resolution.

In 2002 it was announced that BIND, a common implementation of the DNS protocol, permitted DNS server resolutions to be falsified, called cache poisoning. The diagram below gives an overview of the attack. The goal of the attacker is to poison the cache of DNS server with false resolutions, and then poison the client cache on the next request to the DNS server.
image
So how does the attacker get the DNS server to accept false resolutions? The attacker can make a name request to the targeted (victim) DNS server that requires a recursive resolution and then also replies to the recursive request with false information! This is entirely possible due to the unauthenticated nature of DNS requests and replies, and a feature of an earlier version of BIND that allows simultaneous requests for the same name. The only piece of information that links a request to its reply is a 16-bit identifier, that takes on at most 65,536 distinct values. This identifier is chosen by the victim DNS server and the attacker must guess it to have a false response accepted. But then the attacker sends n recursive requests for the same domain and n replies with random identifiers. When n is about 700 the birthday paradox shows that one of the name requests generated by the victim DNS server will match one of the responses generated by the attacker with high probability.

There are many excellent articles that explain the DNS/BIND Birthday Attack and consider the following ones here and here. Both give details on the success of the birthday attack using the following formula for a collision (match) amongst n uniform samples from t possible values

collision

But this collision formula is not quite correct. When n > t then a collision occurs with probability one, but the collision formula above will never be one since (1-1/t)^k is never zero for any value of k. What's the glitch?

Well n*(n-1)/2 is the number of possible pairings amongst n objects, each of which could yield a match. But instead of computing the match as an n-to-n problem, the collision formula computes the probability of a match as a 1-to-(n*(n-1)/2) problem. That is, the collision formula assumes that each of the potential matches from pairing n objects are independent. The collision formula therefore provides an underestimate on the correct probability, which is given as

Now when n = t + 1 the term (1-(n-1)/t) = 0 and the formula yields a collision with probability one. The collision formula is convincing because it looks like the correct function when graphed

image

The graph says that there is a 50% likelihood of a match around 300 samples (or DNS queries in this case), which seems fine since we expect a repeat at slightly more than sqrt(65536) = 256 samples, and the correct value is about 280. As we mentioned above the collision formula underestimates the correct probability, but for increasing n the error as compared to the correct value is decreasing. The yellow part of the graph shows the probability of the conventional spoofing attack where the attacker has a likelihood of n/65536 to match. In a conventional spoofing attack the attacker generates false replies to a single DNS request.

Monday, November 17, 2008

Examples of Risk Profile Graphs

I was looking through the current set of risk presentations on slideshare, and found an interesting one called Creating Risk Profile Graphs, adapted from an article on managing project risks by Mike Griffiths over at the Leading Answers blog. The approach taken is to define and rate project risk factors and then plot their development over time. The graph below (double click for a better view) shows the profile of 10 technology risks over 4 months charted into Excel as "stacked" area graphs. There is another view on the same risk factors here.

risk_profile_graph

The risk factors measured include JDBC driver performance, calling Oracle stored procedures via a web service, and legacy system stability. The project manager rates each risk factor monthly for severity and probability to produce a numerical measure of the risk factor. Plotted over time we see 4 persistent project risks (the steel and light blue, tan and pink plots).

I was not sure if this was a good way to create a risk profile, so I did a Google image search to see what else was out there. The results were very interesting, and reflect the varying approaches risk managers use to present their results.

This first risk graph really leapt off the search page results, if for no other reason than the prominent pink silhouette of Dilbert's manager in the middle. I must confess that the meaning of the graph was not apparent, but luckily its part of an article called Risk Management for Dummies (click and scroll down) where it is explained to be a depiction of the peaks and troughs in the risk exposure pipeline. The graph is tracking the cost of risks over time.

r1

The next profile is a common, if not traditional, format for risks, taken from the integrated risk management guide of the Canadian Fisheries and Oceans department. This is also called a risk rating table, risk map or risk matrix. Likelihood is represented on the horizontal axis and impact (severity) on the vertical axis.

image

In the first two graphs these two dimensions were combined into a single risk measure (vertical) and then plotted over time (horizontal). The graph above has no representation of time and risks are rated against their current status red, yellow and green regions in the table correspond to measures of high, medium and low risk respectively, and the exact choice of the boundaries is flexible but normally fixed across a company. The table is 5 x 5 but it is also common to have 4 x 4, 5 x 5, 6 x 4, and so on. Below is a 3 x 3 variant of the risk rating table with the high risk region moved to the upper left corner rather than the traditional upper right corner. Note that there is an additional qualitative region and that the regions do not respect the table format. This one is also about fish and oceans, and also from those Canadians but part of another guide.

image

The risk rating or profile table below is quite close to the format used in my company, taken from a recent annual report of a healthcare company. Again there are the two dimensions of likelihood and severity. There are 6 risks in the table represented by current rating and target rating after mitigation actions have been performed (the arrows show the improvement of each risk). Note that the mitigation of risk 6 yields no improvement, a reasonably common occurence in practice.

image

Note also that the red, yellow and green have been replaced by different shades of the single colour blue. The colour red has such a strong association with danger that using it in a risk profile is often counterproductive since all attention is drawn to the red region. In the above graph different shades of blue have been used to represent the relative profiles of risks.

The next risk profile is more graphic than graph, and communicates that concentrating your portfolio in shares is high risk, while short term deposits are low risks. It is taken from an investment article in a newspaper. This graph(ic) is not very sophisticated but it does get its main point across.

image

The final profile is more quantitative. The graph rates probability against result (impact), which can be positive (upside), neutral or negative (downside). It is from a post on the risk of global warming. In this case the upside seems rather limited while the negative impact tail extends much further. So the profile of this risk is dominated by its downside. The fat tail of the downside is hard to capture in a rating table for example.

image

Related Post

Friday, November 7, 2008

Weapons of Math Instruction: The Birthday Paradox

In November 2007, a family from Ohio was blessed with the birth of their third child Kayla, who arrived on October 2nd. Kayla will share the same birthday as her two older siblings who were also born on October 2nd, in 2003 and 2006 respectively. Bill Notz, a statistics professor at Ohio State University concluded that the odds of a family having three children born on the same date in different years are less than 8 in a million. That said, the Paez family from San Diego, California, also had their third child born on July 30th of this year. Are these events extraordinary, or just plain ordinary?

wmi For some time I have been wanting to start a series of posts on Weapons of Math Instruction for IT Risk. The name comes from a joke I saw a few a years ago about the evil Al-gebra group. I will be aiming for less math and more instruction since the text format of the blog is somewhat formula-challenged. The first post will be on the Birthday Paradox, a perennial favourite, recently recounted to me by a senior manager. Also there was a recent furore over the reported odds of DNA matching which was a misinterpretation of the Birthday Paradox at heart.

Quadratic Football

The Birthday Paradox (BP) poses the following simple question: how many people need be gathered in a room so that the chance of any two people sharing a birthday is at least 50%? We assume that each person is equally likely to be born on each of the 365 days of the year , and we ignore leap years. Given these assumptions, the surprising answer is 23. The BP is not a true paradox (a logical contradiction) but is rather deeply counterintuitive since 23 seems too smalddl a number to produce a common birthday. Keep this in mind next time you are watching a football (soccer!) game which has 23 people on the field (the two teams of 11 plus the referee). If you want to see the mathematics behind the result, Google spoils for choice with over 53,000 hits. As is often the case with mathematical topics, the Wikipedia article has an excellent explanation, graphics and references.

The BP is an archetypal problem in probability whose solution is less important than the principles it demonstrates. What is the lesson? If we are trying to obtain a match on an attribute with M distinct values, then just under sqrt(2*M) uniform samples are required for better than 50% success. For birthdays, M = 365 and sqrt(2*365) = 27.1, a little bit higher than the correct value of 23. To generalise, if you were looking for a match amongst 1,000,000 attributes then you would need less than 1500 uniform samples - probably lower than you would have estimated/guessed. Where does the sqrt(2*M) term come from?

Well if we have N objects to compare for a potential match, then the number of possible comparisons is N*(N-1)/2. This number can be approximated as (N^2)/2, which grows in proportional to the square of N, and is therefore called a quadratic function. The source of the surprise in the BP is that we are not thinking quadratically, and therefore underestimate the number of potential matches. When N = sqrt(2*M) then (N^2)/2 yields M possible matches. If we assume that the probability of a match is 1/M, then the average number of matches over the M possibilities is M*(1/M) = 1. A closer analysis shows that there is a 50% chance of a match when N is approximately sqrt(2*M*ln2) where sqrt(2*ln2) = 1.18. You can find a straightforward 1-page derivation of a slightly higher bound here.

Johnny's Fallacy with DNA matching

Another reason why the BP is puzzling is that people often mistake the question to be the following: how many people need to be gathered in a room such that the chance of someone sharing a birthday with me is at least 50%? jcIn this case the person is confusing a 1-to-N matching problem (the person to the group) with the required N-to-N matching problem (the group to itself). Famed US talk show host Johnny Carson made this mistake (as related here, p.79), which we will refer to as Johnny's Fallacy. One night there were about 100 people in his studio audience, and he started to search for someone with the same birthday as him. He was disappointed when no one with his birthday turned up. In 100 people the probability that someone would have the same birthday as Johnny is just under 1/4, while the probability that some pair of people share a birthday is 0.9999, essentially a certainty.

Recently there was a occurrence of Johnny's Fallacy on the topic of identifying suspects using DNA testing. The episode was reported in the Freakanomics blog under the title Are the FBI's Probabilities About DNA Matches Crazy?, reporting on a piece from the Los Angeles Times How reliable is DNA in identifying suspects?. In 2001, Kathryn Troyer, a crime lab analyst in Arizona, was running some tests on the state's DNA database and found two felons with remarkably similar genetic profiles. Remarkable in the sense that the men matched at nine of the 13 locations on chromosomes (or loci) commonly used to distinguish people, and that the FBI estimated the odds of such a match to be 1 in 113 billion. This is the 1-in-N probability of a DNA match. Since her initial discovery Troyer has found among about 65,000 felons, there were 122 pairs that matched at nine of 13 loci. The matches here are the result of the N-to-N probabilities. Johnny's Fallacy here would be to conclude that a search of the Arizona DNA database against the given DNA sample would return 122 matches.

trans98Gradually Troyer's findings spread and raised doubts concerning the veracity of DNA testing. The FBI has declared that the Troyer searches (sounds like something from the X Files) are misleading and meaningless, which is not a particularly helpful assessment of Johnny's Fallacy. David Kaye, an expert on science and the law at Arizona State University, remarked that since people's lives are at stake based on DNA evidence, “It [the Troyer matches] has got to be explained.” Steven Levitt in the Freakanomics blog steps up to the plate to provide some numbers. He assumes that the likelihood of a match at a given loci is 7.5%, yielding the odds of a 13-loci match to be about 1 in 400 trillion, and 9-loci match to be 1 in 13 billion. So a DNA database of 65,000 people yields over 2 billion potential matches, and about 100 expected matches on at least 9 loci.

Levitt's numbers are examples to explain the fallacy, or actually a version of the BP where matching of a single personal trait has been substituted with a number of loci. If you google "Kathryn Troyer" there are over a 1000 hits, and her results have generated a storm of controversy. Reading a few of the articles returned by google shows that it will be some time before people fully understand the apparent paradox in this case.

As I mentioned in the introduction to this post, a senior manager recently told me that he used the BP as an example to get people thinking quantitatively in risk assessment workshops. Numbers come with their own logic - calculations can be surprising.

Related Posts