Finding Patterns in Data Breaches

1

Finding Patterns in Data Breaches

Luther MartinOctober 21, 2010

Overview

Attempt at humor Getting in the right frame of mind to think

about statistics A reminder of some concepts from statistics What we can learn from data breaches What this tells us Some generalizations that might or might not

be accurate

2

System development lifecycle (SDLC)

3

Security development lifecycle (SDLC)

4

Estimating some numbers

What’s the probability of an exploitable vulnerability existing in your web server right now?

What’s the probability of your web server being hacked in the next 12 months?

If you don’t encrypt email, what’s the probability of it being intercepted and read on the Internet?

Too hard?

5

Some easier questions

What’s the current mortgage foreclosure rate?

What’s the current fraud loss rate in the US for payment (credit and debit) cards?

What’s the current charge-off rate in the US for credit card loans?

6

The foreclosure rate

Currently about 1 in 381 per month, or about 3 percent per year

http://www.realtytrac.com/trendcenter/

7



Payment card fraud loss rate

8

http://www.kansascityfed.org/Publicat/Econrev/pdf/10q2Sullivan.pdf

http://www.kansascityfed.org/Publicat/Econrev/pdf/10q2Sullivan.pdf

The charge-off rate for credit cards

9

http://www.federalreserve.gov/releases/chargeoff/

http://www.federalreserve.gov/releases/chargeoff/

More about statistics

We described each of these using only one number An average

That’s not the whole story The average person has less than 2 legs!

1.99…< 2 Most people have an above-average number

of legs!

10

Even more about statistics

It’s often useful to have a second number that tells how much variation we have in our data

Sets of data can both have the same average, but be very different

Same mean, different variance

11

The normal distribution

The so-called normal distribution (“bell curve”) appears again and again in statistics

Many things end up with a normal distribution when you might not expect it

12

The Central Limit Theorem

If you add random values together you tend to get a normal distribution

Proof by picture:

13

Why a known distribution is useful

If we know that we have data that follows a particular probability distribution we can predict what we’ll see in the future with fairly good accuracy

If you flip a fair coin 100 times then You’ll get about 50 Heads There’s about a 73 percent chance of getting 45 to 55 Heads There’s about a 2 percent chance of getting more than 60

Heads What this doesn’t do is predict how any particular flip of the

coin will turn out

14

One more review of math: logarithms

Logarithms are exponents So if we have these numbers:

10, 100, 1,000, 10,000, … or 101, 102, 103, 104, …

Then their logarithms are1, 2, 3, 4, …

Note that multiplying corresponds to adding exponents (logs): 102 x 103 = 102+3 = 105

15

Logarithms naturally occur in lots of ways

Human perception of sound (or light) is roughly proportional to logarithm of the sound level rather than the sound level If you double the sound pressure level it doesn’t double how

loud it sounds to us Instead, double the logarithm of the sound pressure level

That’s why decibels are used to measure sound levels, etc

So logarithms may be annoying but they’re also useful in some cases

16

Another use for logarithms

Logarithms are also a good way to handle big ranges in numbers

Radio: transmit kilowatts (1,000 Watts), receive milliwatts (0.001 Watts)

Hard to plot big ranges on one graph Very small numbers look just like zero Taking logarithms makes a big range easier to handle

3 to -3 instead of 1,000 to 0.001

17

What about data breaches?

The most comprehensive data is that maintained by the Open Security Foundation www.datalossdb.org

Currently has information on close to 3,000 data breaches

Probably the most useful source of information on data breaches

What patterns can we find in the OSF’s data?

18

TotalAffected

020

4060

80100

120140

1/1/

2006

5/1/

2006

9/1/

2006

1/1/

2007

5/1/

2007

9/1/

2007

1/1/

2008

5/1/

2008

9/1/

2008

1/1/

2009

5/1/

2009

9/1/

2009

1/1/

2010

5/1/

2010

Mill

ions

Data breaches since 2006

19

VA

TJX

HMRC

HPS

NARA

Making the range of values smaller

20

Log(TotalAffected)

0123456789

1/1/

2006

4/1/

2006

7/1/

2006

10/1

/200

6

1/1/

2007

4/1/

2007

7/1/

2007

10/1

/200

7

1/1/

2008

4/1/

2008

7/1/

2008

10/1

/200

8

1/1/

2009

4/1/

2009

7/1/

2009

10/1

/200

9

1/1/

2010

4/1/

2010

Sort these values to get…

21

Log(TotalAffected)

0

1

2

3

45

6

7

8

9

1 125 249 373 497 621 745 869 993 1117 1241 1365 1489 1613

The log of breach size matches a normal distribution very well

22

mean 3.2, standard deviation 1.2

What does this tell us?

We may be able to understand the process that leads to data breaches

We may be able to predict some things about future data breaches

We may be able to find a good metric for industry-wide efforts to reduce data breaches

We really need comprehensive data to find patterns that might be there Very small breaches are as important as very big

Understanding the process

Just like we get a normal distribution from adding several random values together, we get a lognormal distribution when we multiply several random values together

Multiplying corresponds to adding exponents (logs)

This suggests that what we see for data breaches may be explained by a layered model of security

Abstract layered model of security

The general case: if we have

1. The security provided by two technologies when they’re both used is greater or equal to the security of each of the components when they’re used by themselves

2. If two technologies are independent then the security provided by the two technologies when they’re used together is equal to the sum of the security provided by each of the technologies

3. The security provided by any technology is non-negative

26

It’s more than just data breaches

Note that this model of the effect of bypassing layers of security leading to multiplying the hacker’s success doesn’t just apply to data breaches

It also applies to any other aspect of information security

When we learn how to quantify other types of security incidents we’ll probably find that the damage from them also follows a lognormal distribution

Then we have to have…

A measure of security that works that way has to essentially be a logarithm

Measuring security breaches in terms of logarithms may end up making more sense that measuring security breaches directly

We see it with data breaches We’ll probably see it for other types of losses

once we learn how to quantify those losses

28

Does this interpretation make sense?

Other places where the lognormal distribution appears: The concentration of gold or uranium in ore deposits The latency period of bacterial food poisoning The age of the onset of Alzheimer's disease The amount of air pollution in Los Angeles The abundance of fish species The size of ice crystals in ice cream The number of words spoken in a telephone conversation The length of sentences written by George Bernard Shaw or

Gilbert K. Chestertonhttp://stat.ethz.ch/~stahel/lognormal/bioscience.pdf

29

http://stat.ethz.ch/~stahel/lognormal/bioscience.pdf

What can we predict?

There’s about a 1 percent chance of any breach exposing 1 million or more records

There’s about a 0.1 percent chance of any breach exposing 10 million or more records

We can expect about 68 percent of breaches to expose between 100 and 25,000 records

We can expect about 95 percent of breaches to expose between 6 and 400,000 records

Etc.

30

What can we NOT predict?

How many data breaches we should expect to see in the next 12 months

Whether or not any particular business will suffer a data breach in the next 12 months

Whether or not your business will suffer a data breach in the next 12 months

Etc.

31

Other patterns: Benford’s law

Benford’s law tells us that the leading digits in data tend to not be evenly distributed

Probability of leading digit being n is P(n) = log(1+1/n)

n 1 2 3 4 5 6 7 8 9P(n) 0.30 0.18 0.12 0.10 0.08 0.07 0.06 0.05 0.05

Why Benford’s law might make sense

Consider what happens with exponential growth

Start with 1 and multiply by 1.1 at each step: 1, 1.10 1.21, 1.33, 1.46, 1.61, 1.77, 1.95, 2.14, 2.36, 2.59, 2.85, 3.14, 3.45, 3.80, 4.18, 4.59, 5.05, 5.56, 6.12, 6.73, 7.40, 8.14, 8.95, 9.85, 10.83, … Note that 1 is the most common, etc.

Benford’s law for breaches

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7 8 9

OSF Data

Benford's Law

Other patterns

There are other patterns that we can find But they’re really just ways to repackage the

exponential growth idea No real new ideas

Zipf’s law Pareto’s principle

35

Zipf’s law

Zipf’s law Order the data from biggest to smallest Then the total contribution from any entry is inversely

proportional to its position in the table Second entry is about 1/2 of the first one Third entry is about 1/3 of the first one The nth entry is about 1/n of the first one R2 = 0.873

36

Pareto’s principle

Sometimes known as the “80-20 rule” Very similar to the others that we’ve mentioned In general have k% of the population accounts for

(100 - k)% of something for some k between 50 and 100 For k = 80 we get the 80-20 rule

Empirically, most data seem to cluster around k being in the middle of this range

It’s yet another power law

37

Bottom line

It certainly looks like it’s possible to find some interesting structure in the data that’s available for data breaches

The size of data breaches seems to follow a very well defined pattern

We may see this same pattern in other part of information security when we learn how to quantify other types of losses due to security breaches

We need lots of data to see the patterns in it Data on small breaches is as important as the big ones

38

Practical implications (so what?)

Developing metrics Developing ROI models Pricing insurance Are we winning or are hackers winning? Any time when quantifying a loss is useful Etc.

Some useful references

The OSF’s data breach databasehttp://datalossdb.org/

E. Limpert, W. Stahel and M. Abbt, “Lognormal Distributions across the Sciences: Keys and Clues”http://stat.ethz.ch/~stahel/lognormal/bioscience.pdf

The Voltage corporate bloghttp://superconductor.voltage.com

CSO Magazine article on finding patterns in data breaches

http://www.csoonline.com/article/501584/data-breaches-patterns-and-their-implications

40

http://datalossdb.org/

http://stat.ethz.ch/~stahel/lognormal/bioscience.pdf

http://superconductor.voltage.com/



Technology

Finding Patterns in Data Breaches