There was a recent excellent post on the RiskAnalysis.Is blog reviewing a debate between security gurus Bruce Schneier and Marcus Ranum on the topic of risk management. The post summarized the debate as something of a stalemate, ending in agreement that the lack of data is the root cause of the unsatisfactory state of IT risk management. The post goes on to make some excellent points about risk and data, which deserves a post of its own to describe and ponder. Here I will make a few points about data, models and analysis.
Data is a representation of the real world, observable but in general difficult to understand. The model is a simplification of reality that can be bootstrapped from the data. Finally, analysis is the tools and techniques that extract meaning from the model, which hopefully allows us to make material statements about the real world. Data, Model, Analysis, Meaning.
Let's take a look at how the famous PageRank algorithm creates meaning from data via analysis of a model. We can all really learn something here.
The Wisdom of a Random Crowd of One
The hero of the PageRank story is an anonymous and robotic random surfer. He selects a random (arbitrary) starting page on the internet, looks at links on that page, and then selects one to follow at random (each link is equally likely to be selected). On the new page, he again looks over the links and surfs along a random link. He happily continues following links in this fashion. However, every now and again, the random surfer decides to jump to a totally random page where he then follows random links once again. If we could stand back and watch our random surfer, we would see him follow a series of random links, then teleport to another part of the internet, follow another series of links, teleport, follow links, teleport, follow links, and so on, ad infinitum.
Let's assume that as our random surfer performs this mix of random linking and teleporting, he also takes the time to cast a vote of importance for each page he visits. So if he visits a page 10 times, then the random surfer allocates 10 votes to that page. Surprisingly, the PageRank metric is directly derived from the relative sizes of the page votes cast by this (infinite) random surfer process.
This seems deeply counterintuitive. Why would we expect the surfing habits of a random process to yield a useful guideline to the importance of pages on the internet? While the surfing habits of people may be time consuming, and sometimes downright wasteful, we probably all think of ourselves as more than random-clicking automatons. However the proof of the pudding is in the searching, and Google has 70% of the search market. So apparently when all of the erratic meanderings of the random surfer are aggregated over a sufficiently long period, they do in fact provide a practical measure of internet page importance. We cannot explain this phenomenon any better than by simply labelling it as the wisdom of a random crowd of one.
Breathing Life into the Random Surfer
The data that can be gathered relatively easily by web crawlers are the set of links on a given page and the set of pages they point to. Let's assume that there are M pages currently on the internet, where M is several billion or so. We can arrange the link information into M x M matrix P = [ Pij ] where Pij is the probability that page Pi links to page Pj (Pij is just the number of links on Pi to Pj divided by the total number of links on Pi).
The matrix P is called stochastic since the sum of each row is 1, which simply means that any page must link to somewhere (if Pi has no links then it links to itself with probability 1). So P represents the probability of surfing (linking) from Pi to Pj in one click by a random surfer. The nice property of P is that P^2 = P*P gives the probability that Pj can be reached from Pi in two clicks by the random surfer. And in general P^N gives the probability that the random surfer ends up on page Pj after N clicks, starting from page Pi.
Can we say anything about P^N as N becomes large? Well if P is ergodic (defined below) then there will exist a probability vector
L = (p1, p2, ..., pM)
such that as N becomes large then
P^N = (L, L, ..., L)^t
This says that for large N, the rows of P^N are all tending to the common distribution L. So no matter what page Pi the random surfer starts surfing from, his long run page visiting behaviour is described by L. We learn quite a bit about the random surfer from L.
As we said above, the long run probability vector L only exists for matrices that are ergodic. Ergodic matrices are described by 3 properties: they are finite, irreducible, and aperiodic. Our matrix P is large but certainly finite. Two pages Pi and Pj are said to communicate if it is possible to reach Pj by following a series of links beginning at page Pi. The matrix P is irreducible if all pairs of pages communicate. But this is clearly not the case, since some pages have no links for example (so-called dangling pages). If our random surfer hits such a page then he gets stuck, and we don't get irreducibility and we don't get L.
To get the random surfer up and surfing again we make the following adjustment to P. Recall that we have M pages and let R be the M x M matrix for which each entry is 1/M . That is, R models the ultimate random surfer who can jump from any page to any page in one click. Let d be a value less than one and create a new matrix G (the Google matrix) where
G = d*P + (1-d)*R
That is, G is a combination of P (real link data) and R (random link data). Our random surfer then follows links in P with probability d or jumps (teleports) to a totally random page with probability (1-d). The value of d will be something like 0.85.
Its should be easy to see that G is irreducible since R enables any two pages to communicate in one clieck. Without going into details, G is also aperiodic since it is possible for a page to link to itself (which is also possible in P as well). So G is ergodic and we can in theory compute the long run page distribution L of the random surfer.
So now that we know L exists for G, it remains to compute it. We have not as yet considered that the number of pages M is several billion and growing. So a direct representation of G as an M x M matrix would require storage on the order of 10^(18), or in the exabyte range (giga, tera, peta, exa). Luckily most pages are likely to have only a few links (say less than 20) and we can represent G using lists which will bring us back into the gigabyte range.
Computing L from G is a large but tractable computation. L is an eigenvector of G and there is an iterative algorithm for computing L from G called the power method. The power method begins with an approximation for L and improves on each iteration. The rate of convergence to the true value of L is geometrically fast in terms of the parameter d. Therefore we can compute the long run behaviour of our random surfer.
The diagram below (available from Wikipedia) shows the PageRank analysis for a simple collection of pages. The arrows represent the link data and the pages are drawn in size relative to their PageRank importance. If we divided the number on each page by 100 this would be our long run probability vector L.
What did we learn?
Recall that at the beginning of the post I stated that we need to get beyond pining for data and start to think in terms of data, models and analysis (then meaning). If we now look at the PageRank algorithm we can break it down into
- Data: Raw page linking represented in P
- Model: The Google matrix G = d*P + (1-d)*R, a random perturbation of P
- Analysis: The long run probabilities L of the random surfer.
It could be said that PageRank is one part brilliance and two parts daring. It is not obvious at all that L would produce informative page rankings. However the point is now moot and the wisdom of a random crowd of one has prevailed.
The absence of data has been a scapegoat for years in IT Risk. The real issue is that we don't know what data we want, we don't know what we would do with the data, and we don't know what results the data should produce. These are 3 serious strikes. It seems that everybody would just feel better if we had more data, but few people seem to know (or say) why. We are suffering from a severe and prolonged case of data envy.
In IT Risk we are stubbornly waiting for a data set that is self-modelling, self-analysing and self-explaining. We wish to bypass the modelling and analysis steps, hoping that meaning can be simply read off from the data itself. As if data were like one of those "Advance to Go" cards in Monopoly where we can skip over most of the board and just collect our $200. The problem is that we keep drawing cards that direct us to "Go Back Three spaces" or "Go to Jail".