Showing posts with label Risk Management. Show all posts
Showing posts with label Risk Management. Show all posts

Thursday, September 22, 2011

Liability for Risk Decisions

imageI am currently in-between positions, somewhat happily, and are casting my net of interest a bit wider than my traditional roles in IT Security and Risk. One position that caught my eye from a global reinsurer in town was the role of Earthquake Expert within their Natural Catastrophe department (or Nat Cat in insurance lingo). I really don’t have any specific background in this area but I sometimes entertain the idea that I can transfer hard-learnt crypto math skills into a numerate role like this one which calls for extensive modeling and prediction. You also think that this might be a nice and cozy niche area to ply your trade as a specialist, holding something of a privileged position.

Well I was disabused of any such notion this week when I read this week of six Italian scientists and a former government official are being put on trial for the alleged manslaughter of the 309 people who died in the 2009 L'Aquila earthquake in Italy.

The seven defendants were members of a government panel, called the Serious Risks Commission (seriously), who were asked to give an opinion (or risk statement) on the likelihood that  L'Aquila would be struck by a major earthquake, based on an analysis of the smaller tremors that the city was experiencing over the previous few months. The panel verdict delivered in March stated that there was "no reason to believe that a series of low-level tremors was a precursor to a larger event". A week later the city suffered an earthquake of magnitude 6.3 on the Richter Scale, denoting a “strong quake”.

The crux of the case against the scientists is that they did not predict the strong quake coming to L'Aquila to allow a proper evacuation of its inhabitants. The defense rebuttal is simply that such a prediction is impossible, and they cannot be held accountable for this unreasonable expectation. The scientists cannot be expected to function as a reliable advanced warning system. The international scientific community has weighed in to support the defendants with a one-page letter from the American Association for the Advancement of Science, which supported the scientists by saying that there is no reliable scientific process for earthquake prediction, and they should not be treated as criminals for adhering to the accepted practices of their field.

Recently people were evacuated from New York City as precaution to the impact of Hurricane Irene. The hurricane passed by New York causing far less extensive damage than expected, and yet there were still complaints from residents about being asked to leave their homes “unnecessarily”. It seems that authorities cannot win in these matters unless they can predict the future accurately.

Monday, May 2, 2011

ISACA Risk Assessment Guidelines

I uploaded a 15 page guideline from ISACA for audit risk assessments to my Scribd collections. The document gives a reasonable overview of how a standard IT audit assessment can be enhanced from a risk perspective, taking into account additional factors beyond controls and their gaps.

Thursday, December 23, 2010

Calculus vs. Probability

I am trying out listening to podcasts on my – yes – iPod, during what was figuratively described to me as my “downtime”. In Zurich for me this means being on trams and trains, and walking between them or to them. So I went looking for captivating podcasts and of course ended up at the TED site, where you can download any number of interesting speakers and topics. I came across a short and poignant talk by mathematician Arthur Benjamin's on his formula for changing math education.

image

His simple approach is to switch from calculus being the pinnacle of math education to actually probability and statistics, because while the former is beautiful yet little used, the latter two topics are in fact very practical and in high demand. In short we need to better understand risk. Below is the full text of his short talk, where I have highlighted a few phrases in bold

Now, if President Obama invited me to be the next Czar of Mathematics, then I would have a suggestion

The mathematics curriculum that we have is based on foundation of arithmetic and algebra. And everything we learn after that is building up towards one subject. And at top of that pyramid, it's calculus. And I'm here to say that I think that that is the wrong summit of the pyramid ... that the correct summit -- that all of our students, every high school graduate should know -- should be statistics: probability and statistics. (Applause)

I mean, don't get me wrong. Calculus is an important subject. It's one of the great products of the human mind. The laws of nature are written in the language of calculus. And every student who studies math, science, engineering, economics, they should definitely learn calculus by the end of their freshman year of college. But I'm here to say, as a professor of mathematics, that very few people actually use calculus in a conscious, meaningful way, in their day to day lives. On the other hand, statistics -- that's a subject that you could, and should, use on daily basis. Right? It's risk. It's reward. It's randomness. It's understanding data.

I think if our students, if our high school students -- if all of the American citizens -- knew about probability and statistics, we wouldn't be in the economic mess that we're in today. Not only -- thank you -- not only that ... [but] if it's taught properly, it can be a lot of fun. I mean, probability and statistics, it's the mathematics of games and gambling. It's analyzing trends. It's predicting the future. Look, the world has changed from analog to digital. And it's time for our mathematics curriculum to change from analog to digital. From the more classical, continuous mathematics, to the more modern, discrete mathematics. The mathematics of uncertainty, of randomness, of data -- and that being probability and statistics.

In summary, instead of our students learning about the techniques of calculus, I think it would be far more significant if all of them knew what two standard deviations from the mean means. And I mean it. Thank you very much. (Applause)

I could not agree more. The world is discrete for me, and very few of the problems that I encounter succumb to integration.

Monday, December 6, 2010

Snakes in Suits – the risks from psychopaths in the workplace

A telling presentation from Holly Andrews at a recent IRM meeting on dealing with psychopaths in the workplace (and yes the boardroom), derived from the 2006 book Snakes in Suits: When Psychopaths Go to Work. The presentation describes how workplace psychopaths burrow into positions of power, and amongst other things, assume more risk than is sensible. There is a wonderful process chart which shows how such people operate

image

Transitional organisations can be seen as ideal “feeding grounds” for psychopaths since

  • There are fewer constraints and rules allow the psychopath freedom in acting out their psychopathic manipulation
  • The fast changing environment provides stimulation for the psychopath whilst serving to cover up their failings
  • There is the potential for large rewards in terms or money, power, status and control


Sunday, September 5, 2010

Will there be an IT Risk Management 2.0?

This is the title of a short talk I gave recently at an OWASP chapter meeting in Zurich. The audience was small but engaged, and I went over time by quite a bit.  I need to develop the talk further but it is a decent v1.0.

image

Thursday, March 18, 2010

The Fabled 25 Sigma Event

Last week I was reading a document published by my company called Dealing with the Unexpected, which gives some lessons learnt from the recent credit crisis. Early in the paper the authors speak about a 10 sigma event, or an event that we only expect to see once in 10,000 years. This piqued my interest because I remembered an infamous statement made by some financial leader during the onset of the financial crisis to the effect that we are now experiencing repeated 25 sigma events. Already a 10 sigma event is quite unlikely, but a 25 sigma event is just absurd –exponentially smaller, but just exactly how much? A one in a million year event? One in a billion years?

First, what is sigma? In statistics and probability, the lower case Greek letter sigma is used to denote the standard deviation of a distribution, which as the name implies, is the accepted unit to measure how much an outcome can vary from its mean or average. A Wikipedia article lists the following table for the likelihood of sigma deviations for the standard normal distribution

image

So at 6 sigma deviations we are already talking about events that occur once every 1.5 million years. Note that the scale is not linear, and the difference between 2 and 3 sigma events is less than the difference between 3 and 4 sigma events.

A 25 sigma event must be very unlikely indeed – so unlikely that the references I searched don’t even bother listing this value, including the news articles that carried the original quote. But after searching directly for sigma events I found a wonderful paper from researchers at the business school of the University College Dublin. The researchers are a little more pessimistic (by a factor of 2) in their calculations since they are only concerned with positive deviations away from the mean, as shown in the diagram below for the the 2 sigma case.

image

The paper reminds us that the 25 sigma quote came from David Viniar, CFO of Goldman Sachs, who actually said in the Financial Times in August 2007 that

We were seeing things that were 25-standard deviation moves, several days in a row.

Not just one 25 sigma event, but several! So the likelihood of those higher deviations was calculated to be

image

or 1-in-10^{135} years for a 25 sigma event. For example, this is much less likely than guessing an AES-256 key in one attempt. The researchers offer the following comparison as to how unlikely such an event is

To give a more down to earth comparison, on February 29 2008, the UK National Lottery is currently was offering a prize of £2.5m for a ticket costing £1. Assuming it to be a fair bet, the probability of winning the lottery on any given attempt is therefore 0.0000004. The probability of winning the lottery n times in a row is therefore 0.0000004^n , and the probability of a 25 sigma event is comparable to the probability of winning the lottery 21 or 22 times in a row.

Either Mr. Viniar was very confused or his models were very confused. Or potentially both. A final quote from the paper

However low the probabilities, and however frequently 25-sigma or similar events actually occur, it is always possible that Goldman’s and other institutions that experienced such losses were just unlucky – albeit to an extent that strains credibility.

But if these institutions are really that unlucky, then perhaps they shouldn’t be in the business of minding other people’s money. Of course, those who are more cynical than us might suggest an alternative explanation – namely, that Goldmans and their Ilk are simply not competent at their job. Heaven forbid!

Yes Heaven forbid!

Friday, February 26, 2010

A Short Security Manifesto

From the Falcon's View

Stop talking about traditional "risk management" as some sort of magical rubric or panacea.
Start talking about threat modeling and legal defensibility.

Stop using ad hoc approaches to security architecture and solutions.
Start adopting a holistic, systemic ISMS-like approach.

Stop delegating ownership of security to IT or other non-business leadership.
Start requiring execs and the board to directly own and be responsible for security.

Stop relying on shortcuts to survive audits.
Start demonstrating actual due diligence by adopting a reasonable standard of care.

Stop looking for ROI to "justify" security.
Start thinking of security as a business enabler that facilitates better decisions and helps protect the business during both the good and the bad times.

Thursday, February 18, 2010

Six Myths in Assessing Risk

Great 1-page summary with graphics from business advisory firm Corporate Executive Board

  1. The biggest risk my company faces is financial risk
  2. My company is safe because we review risks and
    prioritize mitigation efforts annually
  3. We are good at risk-sensing because we have invested in
    3 enterprise risk management (ERM) systems
  4. We are well protected because we have a strong
    quantitative model to measure risk
  5. Our risk assessment is comprehensive because we
    account for likelihood and impact
  6. We can sense and protect business better because we manage risks at the business unit (BU) level

Friday, February 12, 2010

Tuesday, January 26, 2010

Slides from my ZISC talk on Black Swans in IT Security

In December I gave a talk at the Zurich Information Security Colloquium, based on a post I made in October 2008. The slides can now be found on Scribd.

Tuesday, October 6, 2009

Risk Analysis Rising

In June I posted on a paper called A Risk Analysis of Risk Analysis, and from that post

The title of this post is taken from a both sobering and sensible paper published last year by Jay Lund, a distinguished professor of civil engineering at the University of California (Davis), who specialises in water management. The paper presents a discussion of the merits of Probabilistic Risk Assessment (PRA), which is a “systematic and comprehensive methodology to evaluate risks associated with a complex engineered technological entity”. PRA is notably used by NASA (see their 320 page guide) as well as essentially being mandated for assessing the operational risks of nuclear power plants during the 80’s and 90’s.

It  is a wonderfully insightful paper that I uploaded to Scribd, who recently informed me that the paper is now on their hotlist. You can get to the paper from the link below. Highly recommended!

Sunday, September 20, 2009

My Top 10 Security and Risk Uploads to Scribd

I have been reading and uploading to Scribd for several years now. It is really a vast source of documents and its seems that it has been a victim of its own popularity since now so many varied and inconsequential documents are finding their way to to site. The search function is not quite as effective as it was, and as always been true, the site itself is quite slow.

Over the last couple of years I have slowly uploaded just over 40 documents and presentations, mostly in the area of security and risks. For the last few months I have been getting just over 100 hits per day, and about 12 downloads per day. The total number of hits is now getting close to 20,000, and will reach that mark in the next week. Here is a list of the top 10 visited documents that I have uploaded – the number of reads is in parentheses, and documents in bold type are written by me

  1. A Data Centric Security Model (1529)

  2. ISACA Risk Framework (1498)

  3. How much is enough? A Risk Management Approach to Computer Security (1290)

  4. Does IT Security Matter? (1127)

  5. Entropy Bounds for Traffic Confirmation (886)

  6. Risk Analysis of Power Station survival of Cyber (712)

  7. Password Authentication on Mac OS X from Dave Dribin (704)

  8. An analysis of the Linux Random Number Generator (702)

  9. The Core Components of the Entrust PKI v5 (677)

  10. Canadian Government 1999 Threat and Risk Assessment Guide (628)

Wednesday, June 24, 2009

The Risk of Degradation to GPS

In April the Government Accountability Office (GAO), the audit and investigative arm of the US Congress, announced the results of their study on sustaining the current GPS service. The main finding was that the GPS service is likely to degrade over the next few years, both in terms of coverage and accuracy, due a decrease in the number of operational satellites. Using data provided by the US Department of Defense (DoD), the GAO ran simulations to determine the likelihood that GPS can be maintained at its agreed performance level of 24 satellites operating at 95% availability. The graph below (double-click to enlarge) shows a 24-strong GPS constellation dipping below 95% availability in the 2010 fiscal year, and dropping as low as 80% before recovering in 2014. The jittery sawtooth nature of the graph is derived from the tussle between the failure of existing satellites and the launching of replacements, with the failure rate dominating for the next few years.
imageNeedless to say the GAO findings have been widely discussed, and were further publicised in a recent televised congressional hearing. The US Air Force, who runs the GPS program for the DoD, has had to assure its military peers, various congressmen and an anxious public that the GPS service is in fact not on the brink of failure – a scenario not even considered by the GAO report. Articles in the popular press such as Worldwide GPS may die in 2010, say US gov from the Register are not helping matters. So how did the GPS service end up in this predicament? According to GAO, the culprit is poor risk management in the execution of the GPS modernisation program.
GPS is a critical service, particularly for the military, as it provides information for the calculation of position, velocity and time. As noted in the GAO report, “GPS has become a ubiquitous infrastructure underpinning major sections of the economy, including telecommunications, electrical power distribution, banking and finance, transportation, environmental and natural resources management, agriculture, and emergency services in addition to the array of military operations it services”. Specifically, GPS is used to guide bombs and missiles to their targets – and we don’t want inaccuracy in those calculations!
There are currently 31 operational satellites, orbiting 12,600 miles (20,200 kilometres) above the Earth, a seemingly safe margin over the required 24. The constellation has grown to this size as the current roster of satellites have performed far beyond their expected operational lifetimes. Even so, according to a DoD report issued last October, 20 satellites are past their design life, and 19 are without redundancy in critical hardware components.
The main threat scenario is that a substantial number of satellites will reach their operational end-of-life before they can be replaced, thus reducing the size of the constellation. Or simply put, the satellite failure rate may exceed the refresh rate. This is not really a question of whether GPS will become extinct (all satellites fail) since GPS will become ineffective long before the number of satellites gets anywhere near zero.
What is the impact of a degraded GPS service? Well the first point is that GPS currently delivers a much better service than committed to, due to the additional satellites above the required 24. So the service impact when dropping below 24 satellites will be quite noticeable. The accuracy of GPS-guided missiles and bombs will decrease, therefore increasing the risk of collateral damage. This leads to a viscous circle where even more missiles or bombs will be required to take out a given target.
Since the current generation of satellites have lasted so long, and GPS still remains at threat from dropping below 24-strong constellation, then there must be some problems with the rate at which the constellation is being replenished. And according to the GAO report, there have indeed been severe problems in executing the GPS program as planned. The current GPS program has experienced cost increases and schedule delays. The launch of the first new satellite is almost 3 years late and the cost to complete the new program will be $870 million over the original estimate.
GAO cites a multitude of reasons for this predicament including multiple contractor mergers, moves and acquisitions, technology over-reach (a common malady for military projects), the short tenure of program leaders, and general “diffuse leadership” (no one group or person is really in charge).
GAO strongly recommends an improved risk management process. In a recent post The Risk Analysis of Risk Analysis I reviewed an article on when to apply a sophisticated risk methodology called Probabilistic Risk Assessment (PRA). The conclusion was that the difficulty, expense and potential inaccuracy of PRA can only be justified when projects are on a grand scale, and the multi-billion dollar GPS program certainly qualifies. And here the risk equation is not merely about technicalities and project management (hard as they are). There is also an overarching directive from the US government to be the premier global provider of GPS services. Europe, Russia and China are creating their own constellations, but relying on these “foreign” constellations does not seem to be an option.
Various representatives from the DoD have responded to the GAO report, stating that action must and will be taken to improve the current GPS constellation. It is likely that the service will experience degradation over the next 5 years, but the DoD claims it be managed and predicted (you can calculate when and where there will be gaps). Let’s hope they’re right.

Thursday, June 4, 2009

A Risk Analysis of Risk Analysis

The title of this post is taken from a both sobering and sensible paper published last year by Jay Lund, a distinguished professor of civil engineering at the University of California (Davis), who specialises in water management. The paper presents a discussion of the merits of Probabilistic Risk Assessment (PRA), which is a “systematic and comprehensive methodology to evaluate risks associated with a complex engineered technological entity”. PRA is notably used by NASA (see their 320 page guide) as well as essentially being mandated for assessing the operational risks of nuclear power plants during the 80’s and 90’s.

Professor Lund’s views are derived from his experiences in applying PRA to decision-making and policy-setting for engineering effective water management, as well as from teaching PRA methods. His paper starts with two propositions: (1) PRA is a venerated collection of mathematically rigorous methods for performing engineering risk assessments, and (2) PRA is rarely used in practice. Given the first proposition he seeks to provide some insight into the “irrational behaviour” that has lead to the second proposition. Why don’t risk assessors use the best tools available?

Discussions on the merits of using modeling and quantitative risk analysis in IT Security flare up quite regularly in the blogosphere. Most of the time the discussions are just storms in HTML teacups – the participants usually make some good points but the thread rapidly peters out since both the detractors and defenders typically have no real experience or evidence to offer either way. So you either believe quant methods would be a good idea to use or you don’t. With Lund we have a more informed subject who understands the benefits and limits of a sophisticated risk methodology, and has experience with its use in practice for both projects and policy-setting.

Know Your Decision-Makers

After a brief introduction to PRA, Lund begins by providing some anecdotal quotes and reasoning for PRA being passed over in practice.

People would rather live with a problem that they cannot solve than accept a solution that they cannot understand.

Decision-makers are more comfortable with what they are already using. As I was once told by a Corps manager, “I don’t trust anything that comes from a computer or from a Ph.D.”

“Dream on! Hardly anyone in decision-making authority will ever be able to understand this stuff.”

PRA is too hard to understand. While in theory PRA is transparent, in practical terms, PRA is not transparent at all to most people, especially lay decision makers, without considerable investments of time and effort.

So the first barrier is the lack of transparency in PRA to the untrained, who will often be the decision-makers. There is an assumption here that risk support for decisions under uncertainty can be provided in the form of concise, transparent and correct recommendations – and PRA is not giving decision makers that type of output. But I think in at least some cases this expectation is unreasonable. For some decisions there will be a certain amount of inherent complexity and uncertainty which cannot be winnowed away for the convenience of presentation. I am not sure, for example, to what extent the risks associated with a major IT infrastructure outsourcing can be made transparent to non-specialists.

The next few comments from Lund are quite telling.

People who achieve decision-making positions typically do so based on intuitive and social skills and not detailed PRA analysis skills.

Most decisions are not driven by objectives included in PRA. Decision-makers are elected or appointed. Being re-elected can be more important than being technically correct on a particular issue. Empirical demonstration of good decisions from PRA is often unavailable during a person’s career.

So decision-makers are usually not made decision-makers based on their analytical skills, and what motivates such people may well be outside of the scope of what PRA considers “useful” decision criteria. Actually developing a methodology tailored to solving risk problems, in isolation to the intended decision-making audience, is counter-productive.

And here is the paradox as I see it

A poorly-presented or poorly-understood PRA can raise public controversy and reduce the transparency and credibility of public decisions. These difficulties are more likely for novel and controversial decisions (the same sorts of problems where PRA should be at its most rigorous).

So for complex decisions that potentially have the greatest impact in terms of costs and/or reputation, in exactly the circumstances where a thorough risk assessment is required, transparency rather than rigour is the order of the day.

Process Reliability

Lund notes that PRA involves a sequence of steps that must each succeed to produce a reliable result. Those steps are problem formulation, accurate solution to the problem, correct interpretation of the results, and then proper communication of the results to stakeholders or decision-makers. In summary then we have four steps: formulation, solution, interpretation and communication. He asks

What is the probability that a typical consultant, agency engineer, lay decision-maker, or even a water resources engineering professor will accurately formulate, calculate, interpret, or understand a PRA problem?

He makes the simple assumption that the probability of each step succeeding is independent, which he justifies by saying that the steps are often segregating in large organizations. In any case, he presents the following graph which plots step (component) success to overall success.

image

Lund describes this as a sobering plot since it shows that even with a 93% success at each step then the final PRA is only successful with 75%. When the step success is only 80% then the PRA success is just 41% (not worth doing). We should not take the graph as an accurate plot but rather to show the perhaps non-intuitive relation between step (component) success and overall success.

A Partial PRA Example

Lund also describes an interesting example of a partial PRA, where deriving a range of solutions likely to contain the optimal solution to support decision-making is just as helpful as finding the exact optimal solution. The problem he considers is straightforward: given an area of land that has a fixed damage potential D, what is the risk-based optimal height of a levee (barrier or dyke) to protect the land which minimizes expected annual costs? The graph below plots the annual cost outcomes across a wide range of options.

image

There are three axes to consider – one horizontal (the levee height), a left vertical (annual cost) and a right vertical (recurrence period). Considering the left vertical at a zero height levee (that is, no levee), total annual costs are about $850 million or the best part of a billion dollars damage if left unaddressed. Considering the right vertical, for a 20m levee, costs are dominated by maintaining the levee and water levels exceeding the levee height (called an overtopping event) are expected less than once per thousand years.

The recurrence period states that the water levels reaching a given height H will be a 1-in-T year event, which can also be interpreted as the probability of the water level reaching H in one year is 1/T. For a levee of less than 6m in height there is no material difference between the total cost and the cost of damage, which we can interpret as small levees being cheap and an overtopping event likely.

At 8m - 10m we start to see a separation between the total and damage cost curves, so that the likelihood of an overtopping event is decreasing and levee cost increasing. At 14m, levee costs are dominant and the expected annual damage from overtopping seems marginal. In fact, the optimal solution is a levee of height 14.5m, yielding a recurrence period for overtopping of 102 years. Varying the levee height by 1m around the optimal value (either up or down), gives a range of $65.6 - $66.8 million for total annual costs. Lund makes some excellent conclusions from this example

a) Identifying the range of promising solutions which are probably robust to most estimation errors,

b) Indicating that within this range a variety of additional non-economic objectives might be economically accommodated, and

c) Providing a basis for policy-making which avoids under-protecting or over-protecting an area, but which can be somewhat flexible.

I think that this is exactly the type of risk support for decision-making that we should be aiming for in IT Risk management.

Last Remarks

The paper by Professor Lund is required reading at only 8 pages. PRA he surmises can be sub-optimal when it has high costs, a potentially low probability of success, or inconclusive results. His final recommendation is to reserve its use to situations involving very large expenditures or having very large consequences, large enough to justify the kinds of expenses needed for a PRA to be reliable. Note that he does not doubt PRA can be reliable but you really have to pay for it. In IT risk management I think we have more to learn from Cape Canaveral and Chernobyl than from Wall Street.

Sunday, May 3, 2009

The $28,000 Question: Project vs. Production Risk

The average cost of an American wedding in 2007 was $28,000. Jeremiah Grossman recently posted that for the same money you could fix the critical vulnerabilities lurking at your website.

In his experience the average number of serious flaws per website is 7, each of which will take an average of 40 hours to fix - confirmed by 1000-strong Twitter poll. Then assuming a programming cost of $100/hour you arrive at the figure of

$28,000 = 7 x 40 x $100

in “outstanding insecure software debt” per website. Of course there will be sites that are in much worse shape. As Grossman observes, this figure is not very high, and he asks whether this estimate really supports the implementation of a costly secure software development life cycle ?

I think that the key point here is to distinguish between project risks and production risks. A project manager (PM) is concerned naturally with project risks, whose impact can be broadly classified as increased costs, delivery delays and reduced functionality. If we express a risk as a threat, vulnerability and an impact, then for the PM impacts reduce to cost overruns, time overruns and functionality “underruns” (plus combinations thereof). In general, expending time and resources to identify and fix potential security vulnerabilities is not effective in the PM’s risk model, since the vulnerabilities are unlikely to impact required functionality. Software with significant security vulnerabilities may function perfectly well, right up to, and including, the point of exploitation. As such, security vulnerabilities are not high on the risk radar of the PM.

When we move to the production risk model then potential impacts change dramatically, which for web applications, Grossman lists as

… down time, financial fraud, loss of visitor traffic and sales when search engines blacklist the site, recovery efforts, increased support call volume, FTC and payment card industry fines, headlines tarnishing trust in the brand, and so on are typical. Of course this assumes the organization survives at all, which has not always been the case.

The “meaningful” impact costs are therefore situated in the production risk model rather than the project risk model. A source of misunderstanding (and possibly friction) between security and project people is the difference in risk models or outlooks, since most security people assume the view of production risks – it is their role in fact. When Marcus Ranum recently remarked

I don’t know a single senior security practitioner who has not, at some point or other, had to defend an estimated likelihood of a bad thing happening against an estimated business benefit.

I believe that he was talking about the dichotomy between project and production risk. So returning the Grossman’s original issue, the $28,000 to fix web vulnerabilities does not support the deployment of a secure SDL in the project risk model, but it makes much better sense in the production risk model.

Related Posts

Monday, March 2, 2009

The Wisdom of a Random Crowd of One

(This is a repost as the old link stopped working)

There was a recent excellent post on the RiskAnalysis.Is blog reviewing a debate between security gurus Bruce Schneier and Marcus Ranum on the topic of risk management. The post summarized the debate as something of a stalemate, ending in agreement that the lack of data is the root cause of the unsatisfactory state of IT risk management. The post goes on to make some excellent points about risk and data, which deserves a post of its own to describe and ponder. Here I will make a few points about data, models and analysis.

Data is a representation of the real world, observable but in general difficult to understand. The model is a simplification of reality that can be bootstrapped from the data. Finally, analysis is the tools and techniques that extract meaning from the model, which hopefully allows us to make material statements about the real world. Data, Model, Analysis, Meaning.

Let's take a look at how the famous PageRank algorithm creates meaning from data via analysis of a model. We can all really learn something here.

The Wisdom of a Random Crowd of One

The hero of the PageRank story is an anonymous and robotic random surfer. He selects a random (arbitrary) starting page on the internet, looks at links on that page, and then selects one to follow at random (each link is equally likely to be selected). On the new page, he again looks over the links and surfs along a random link. He happily continues following links in this fashion. However, every now and again, the random surfer decides to jump to a totally random page where he then follows random links once again. If we could stand back and watch our random surfer, we would see him follow a series of random links, then teleport to another part of the internet, follow another series of links, teleport, follow links, teleport, follow links, and so on, ad infinitum.

Let's assume that as our random surfer performs this mix of random linking and teleporting, he also takes the time to cast a vote of importance for each page he visits. So if he visits a page 10 times, then the random surfer allocates 10 votes to that page. Surprisingly, the PageRank metric is directly derived from the relative sizes of the page votes cast by this (infinite) random surfer process.

This seems deeply counterintuitive. Why would we expect the surfing habits of a random process to yield a useful guideline to the importance of pages on the internet? While the surfing habits of people may be time consuming, and sometimes downright wasteful, we probably all think of ourselves as more than random-clicking automatons. However the proof of the pudding is in the searching, and Google has 70% of the search market. So apparently when all of the erratic meanderings of the random surfer are aggregated over a sufficiently long period, they do in fact provide a practical measure of internet page importance. We cannot explain this phenomenon any better than by simply labelling it as the wisdom of a random crowd of one.

Breathing Life into the Random Surfer

The data that can be gathered relatively easily by web crawlers are the set of links on a given page and the set of pages they point to. Let's assume that there are M pages currently on the internet, where M is several billion or so. We can arrange the link information into M x M matrix P = [ Pij ] where Pij is the probability that page Pi links to page Pj (Pij is just the number of links on Pi to Pj divided by the total number of links on Pi).

The matrix P is called stochastic since the sum of each row is 1, which simply means that any page must link to somewhere (if Pi has no links then it links to itself with probability 1). So P represents the probability of surfing (linking) from Pi to Pj in one click by a random surfer. The nice property of P is that P^2 = P*P gives the probability that Pj can be reached from Pi in two clicks by the random surfer. And in general P^N gives the probability that the random surfer ends up on page Pj after N clicks, starting from page Pi.

Can we say anything about P^N as N becomes large? Well if P is ergodic (defined below) then there will exist a probability vector

L = (p1, p2, ..., pM)

such that as N becomes large then

P^N = (L, L, ..., L)^t

This says that for large N, the rows of P^N are all tending to the common distribution L. So no matter what page Pi the random surfer starts surfing from, his long run page visiting behaviour is described by L. We learn quite a bit about the random surfer from L.

As we said above, the long run probability vector L only exists for matrices that are ergodic. Ergodic matrices are described by 3 properties: they are finite, irreducible, and aperiodic. Our matrix P is large but certainly finite. Two pages Pi and Pj are said to communicate if it is possible to reach Pj by following a series of links beginning at page Pi. The matrix P is irreducible if all pairs of pages communicate. But this is clearly not the case, since some pages have no links for example (so-called dangling pages). If our random surfer hits such a page then he gets stuck, and we don't get irreducibility and we don't get L.

To get the random surfer up and surfing again we make the following adjustment to P. Recall that we have M pages and let R be the M x M matrix for which each entry is 1/M . That is, R models the ultimate random surfer who can jump from any page to any page in one click. Let d be a value less than one and create a new matrix G (the Google matrix) where

G = d*P + (1-d)*R

That is, G is a combination of P (real link data) and R (random link data). Our random surfer then follows links in P with probability d or jumps (teleports) to a totally random page with probability (1-d). The value of d will be something like 0.85.

Its should be easy to see that G is irreducible since R enables any two pages to communicate in one clieck. Without going into details, G is also aperiodic since it is possible for a page to link to itself (which is also possible in P as well). So G is ergodic and we can in theory compute the long run page distribution L of the random surfer.

So now that we know L exists for G, it remains to compute it. We have not as yet considered that the number of pages M is several billion and growing. So a direct representation of G as an M x M matrix would require storage on the order of 10^(18), or in the exabyte range (giga, tera, peta, exa). Luckily most pages are likely to have only a few links (say less than 20) and we can represent G using lists which will bring us back into the gigabyte range.

Computing L from G is a large but tractable computation. L is an eigenvector of G and there is an iterative algorithm for computing L from G called the power method. The power method begins with an approximation for L and improves on each iteration. The rate of convergence to the true value of L is geometrically fast in terms of the parameter d. Therefore we can compute the long run behaviour of our random surfer.

The diagram below (available from Wikipedia) shows the PageRank analysis for a simple collection of pages. The arrows represent the link data and the pages are drawn in size relative to their PageRank importance. If we divided the number on each page by 100 this would be our long run probability vector L.

pagerank_graph

What did we learn?

Recall that at the beginning of the post I stated that we need to get beyond pining for data and start to think in terms of data, models and analysis (then meaning). If we now look at the PageRank algorithm we can break it down into

  • Data: Raw page linking represented in P
  • Model: The Google matrix G = d*P + (1-d)*R, a random perturbation of P
  • Analysis: The long run probabilities L of the random surfer.

It could be said that PageRank is one part brilliance and two parts daring. It is not obvious at all that L would produce informative page rankings. However the point is now moot and the wisdom of a random crowd of one has prevailed.

The absence of data has been a scapegoat for years in IT Risk. The real issue is that we don't know what data we want, we don't know what we would do with the data, and we don't know what results the data should produce. These are 3 serious strikes. It seems that everybody would just feel better if we had more data, but few people seem to know (or say) why. We are suffering from a severe and prolonged case of data envy.

In IT Risk we are stubbornly waiting for a data set that is self-modelling, self-analysing and self-explaining. We wish to bypass the modelling and analysis steps, hoping that meaning can be simply read off from the data itself. As if data were like one of those "Advance to Go" cards in Monopoly where we can skip over most of the board and just collect our $200. The problem is that we keep drawing cards that direct us to "Go Back Three spaces" or "Go to Jail".

Wednesday, February 4, 2009

Financial Cyber Risk Guide from ANSI

In October last year ANSI released a new guide addressing the financial impact of cyber risks. From the title you may expect lengthy calculation is costing cyber risks but in fact the document is largely a set of question to create a dialogue around cyber risks. This is not a consolation prize. I have written a short summary of the document which you can read from Scribd below. You can also read a quick review from the Security4all blog.

ANSI approach to the financial impact of cyber risk

Monday, January 12, 2009

Some books on Scribd

As I mentioned in my last post, there is a lot of very interesting and detailed content of all types being uploaded to Scribd. According to Wikipedia,

Scribd is a document sharing website. It houses 'more than 2 million documents' and 'drew more than 21 million unique visitors in May 2008, little more than a year after launching, and claims 1.5 million registered users.' The site was initially funded with $12,000 funding from Y Combinator, but has since received over $3.7 million from Redpoint Ventures and The Kinsey Hills Group.

You can even find whole books on the site. Here are some interesting documents that I found from a hour or so of searching

I think I will drop my Safari account as I now have enough reading for far more than foreseeable future. I also uploaded a paper that I co-wrote on Data Centric Security
A Data Centric Security Model


Wednesday, December 3, 2008

Not One in a Million, but a Million and One

The summer Olympics took place in August this year, hosted by both Beijing and Hong Kong. Every event has its own group of dedicated followers who are prepared to miss sleep and discuss their sports heroes endlessly - whether it is a freestyle swimmer, a javelin thrower, or a marathon runner. As always, one of the premier events was the 100 meter sprint where athletes complete for the title of the fastest man or woman in the world.

Top athletes will cover the 100 meter distance in less than 10 seconds, meaning that they are travelling at an average speed of 10 meters per second. Unless you have some experience in sprinting you may not appreciate this feat. Imagine what would happen if you and few of your work colleagues were to race an Olympic athlete. Well, it is likely that the results would be quite embarrassing. The Olympian would probably finish between two and five seconds—that's 20 to 50 meters—ahead of the non-olympian competitors. And if you raced later that day, the next day, the next week, and the next month, the result would always be the same. If we think of sprinting as a field of expertise, then it is simple to distinguish the experts from the non-experts. Expertise is easily demonstrated and recognized in many fields. Piano playing, ballet and cooking are examples. But there is one field where the track record of many so-called experts is quite dismal, and that is in the area of decision making.

We need look no further than information technology (IT) for predictions and decisions that have turned out to be spectacularly wrong. In 1943 the chairman and founder of IBM, Thomas Watson, thought that there was a world market for about 5 computers (they were much bigger back then). About thirty years later, Ken Olsen the then head of DEC computers, could not see why anyone would want a computer in his home. And more recently, Tim Berners-Lee spent several years trying to convince managers at CERN (the European Center for Nuclear Research) that his HTTP protocol was a good thing (he later went on to invent the World Wide Web). Apparently there is a shortage of Olympian IT decision makers.

Picture2 - book In a recent book, The Wisdom of Crowds, author James Surowiecki examines a collection of problems that seem better suited to solve by involving many non-experts rather than relying on a few experts. His book opens with an anecdote from Francis Galton, a famous British scientist, as he strolled through a country fair in 1906.

Galton came upon a contest where people were asked to guess the weight of an ox on display. Around 800 people tried their luck, paying a sixpence to guess the ox’s weight in return for the chance of winning a prize. After the contest was over, Galton decided to perform an impromptu experiment—take all the submitted guesses and see how close the average of these answers was to the true weight of the ox. Galton thought surely that the average guessed weight must be far from the true weight since so many people of varied backgrounds and abilities (a general crowd) had submitted guesses.

But to Galton’s surprise, the true weight of the ox was 1,198 pounds and the average of the guesses was 1,197 pounds. Thus a crowd of people at a country fair had collectively determined the weight of the ox to within one pound, or less than half a kilogram.

Picture3 - crowd The book goes on further to discuss which types of problems can be effectively solved by crowds, and under what conditions the crowd will be expected to produce a good solution. Risk management is about making decisions today that will protect us from the uncertainty of the future. We are not looking for one in a million (the expert) but rather a million and one (the power of many). The recent subprime debacle has highlighted the shortcomings of quantitative models. Risk scenarios and rankings produced through a consensus process involving many people (a crowd) are likely to produce more meaningful results.

Your involvement is both necessary and critical.

Monday, June 2, 2008

Goodbye Yellow Brick Road


In 2003 the Computer Research Association sponsored a workshop on the Grand Research Challenges in Information Security & Assurance. The by-invitation-only event brought together 50 scientists, educators, business people, futurists, and others who have some vision and understanding of the big challenges (and accompanying advances) that should shape the research agenda in this field over the next few decades. The final report listed 4 main challenges worthy of sustained resourcing and effort:
  1. Eliminate epidemic-style attacks within the next decade
  2. Develop tools and principles that allow construction of secure large-scale systems
  3. Give end-users security controls they can understand and privacy they can control for the dynamic, pervasive computing environments of the future
  4. Develop quantitative information-systems risk management to be at least as good as quantitative financial risk management within the next decade.

In the 4th challenge security (risk) professionals are being asked to follow the yellow brick road to the emerald city of quantitative financial risk management (QFRM) and the wizards therein. A recent article from a May issue of the Economist examines the state of QFRM in light of the subprime debacle, highlighting the $30 billion write down of UBS (Used to Be Smart) as the (sub)prime example of flawed risk management. The outlook in the emerald city is professionally gloomy.

One of the main quantitative culprits identified by the Economist is Value-at-Risk, usually written as the contraction VaR (presumably to distinguish it from VAR and Var, long standing contractions for the Variance). VaR is the centrepiece in many QFRM toolboxes, being honoured with inclusion in the Basel II guidelines for calculating reserve capital. But the subprime debacle has highlighted one the well-known weaknesses of VaR - that it is not strong at predicting the low-probability/high-impact events that are attendant to catastrophe.

VaR is essentially a simple concept supported by arbitrarily complex modelling (see this paper from the upcoming WEIS 2008 conference for VaR applied to information security). Let A be an asset for which we may realise a loss over a defined time period T. Given a level of significance a, the VaR of A is the maximum loss L that will occur over the time period T with probability 1 - a. So if A is a stock of interest over the next T = 100 days, and we fix a to be 0.01, then the VaR of A is the maximum loss L that will occur over the next 100 days with probability 0.99 (or 99% of the time). The interpretation here is that the losses from A will be at most L on 99 out of 100 days of trading.

But what about that other one day of trading not covered? The magnitude of the loss on that rogue day is outside the scope of the VaR model, and need not be proportional to the loss bound predicted by VaR. In fact, VaR is designed to answer questions of the form "Within reason, how bad can things get?", which seem very sensible until we acknowledge that the subprime debacle was not "within reason". As the Economist observes, after a few years of bootstrapping and estimation, VaR models are transferred onto historical data and settle down to predicting a stable future from a stable past, leading to the conceit that risks can be quantified and regulated.

One pyrrhic benefit from the subprime debacle is that VaR models can now be recalibrated with catastrophic data sets, and should therefore produce better predictions. The Economist notes a new interest in non-statistical models based on enumerating risk scenarios that describe what could go wrong, and then thinking through the consequences of the scenario crystallizing. Scenario generation is typically a people-intensive activity, facilitated through brainstorming and workshops - not the forte of quants. Nonetheless scenario-driven risk analysis (SDRA) has the ability to uncover root causes and dependencies that may be absent or insufficiently weighted in quantitative models. On the other hand, the SDRA may fail to generate an exhaustive set of material scenarios, and more mundanely, poorly facilitated sessions can lead to futile bickering over rating the significance of scenarios.

tangle Regardless of the model used, the Economist notes that risk management is becoming less tractable due to complexities and dependencies. Partitioning risks into those related to credit, markets and liquidity is no longer sufficient since the risk inherent in some of the subprime financial products did not respect these organisational boundaries. In short, we have entanglement. Further, obtaining aggregate risk positions is becoming more difficult since some departments still maintain their risk exposure on desktop Excel models, and over-the-counter dealings that are not formally traded also contribute to uncertain aggregate positions. For shareholders, and even regulators, it is very difficult to unwind and assess the risk exposure of a company.

What conclusions might we have for IT (Security) risk management? Rich Bejtlich has commented on the Economist article, and made direct comparisons to the difficulties of risk in financial environments to those in IT environments. The good news is that we in IT Risk should no longer feel compelled to wed our futures to the QFRM yellow brick road, and perhaps we are better served by SDRA. We can also stop beating ourselves up on the point that the weakness of IT Risk is the absence of data - the real weakness is poor modelling, and the decisions based on the output of such models. The Computer Research Association grand challenges of 2003 may be just too grand, and in fact unnecessary.