45
1 ©MapR Technologies - Confidential Scalability in Hadoop and Similar Systems

Chicago Hadoop in Finance - Ted Dunning

Embed Size (px)

DESCRIPTION

Talk about what scalability really means in terms of interacting processes and statistics of growth by Ted Dunning

Citation preview

Page 1: Chicago Hadoop in Finance - Ted Dunning

1©MapR Technologies - Confidential

Scalability in Hadoop and Similar Systems

Page 2: Chicago Hadoop in Finance - Ted Dunning

2©MapR Technologies - Confidential

Big is the next big thing

Big data and Hadoop are exploding

Companies are being funded

Books are being written

Applications sprouting up everywhere

2

Page 3: Chicago Hadoop in Finance - Ted Dunning

3©MapR Technologies - Confidential

Slow Motion Explosion

3

Page 4: Chicago Hadoop in Finance - Ted Dunning

4©MapR Technologies - Confidential

Hadoop Explosion

4

Page 5: Chicago Hadoop in Finance - Ted Dunning

5©MapR Technologies - Confidential

Why Now?

But Moore’s law has applied for a long time

Why is Hadoop exploding now?

Why not 10 years ago?

Why not 20?

59/18/12

Page 6: Chicago Hadoop in Finance - Ted Dunning

6©MapR Technologies - Confidential

Size Matters, but …

If it were just availability of data then existing big companies would adopt big data technology first

6

Page 7: Chicago Hadoop in Finance - Ted Dunning

7©MapR Technologies - Confidential

Size Matters, but …

If it were just availability of data then existing big companies would adopt big data technology first

They didn’t

7

Page 8: Chicago Hadoop in Finance - Ted Dunning

8©MapR Technologies - Confidential

Or Maybe Cost

If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

8

Page 9: Chicago Hadoop in Finance - Ted Dunning

9©MapR Technologies - Confidential

Or Maybe Cost

If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

They didn’t

9

Page 10: Chicago Hadoop in Finance - Ted Dunning

10©MapR Technologies - Confidential

Backwards adoption

Under almost any threshold argument startups would not adopt big data technology first

10

Page 11: Chicago Hadoop in Finance - Ted Dunning

11©MapR Technologies - Confidential

Backwards adoption

Under almost any threshold argument startups would not adopt big data technology first

They did

11

Page 12: Chicago Hadoop in Finance - Ted Dunning

12©MapR Technologies - Confidential

Everywhere at Once?

Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small

12

Page 13: Chicago Hadoop in Finance - Ted Dunning

13©MapR Technologies - Confidential

Everywhere at Once?

Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small

Why?

13

Page 14: Chicago Hadoop in Finance - Ted Dunning

14©MapR Technologies - Confidential

More data is being produced more quicklyData sizes are bigger than even a very large computer can holdCost to create and store continues to decrease

The Conventional Answer

BUSTED!

Page 15: Chicago Hadoop in Finance - Ted Dunning

15©MapR Technologies - Confidential

Analytics Scaling Laws

Analytics scaling is all about the 80-20 rule – Big gains for little initial effort– Rapidly diminishing returns

The key to net value is how costs scale– Old school – exponential scaling– Big data – linear scaling, low constant

Cost/performance has changed radically– IF you can use many commodity boxes

Page 16: Chicago Hadoop in Finance - Ted Dunning

16©MapR Technologies - Confidential

We knew that

We should have known that

We didn’t know that!

You’re kidding, people do that?

Page 17: Chicago Hadoop in Finance - Ted Dunning

17©MapR Technologies - Confidential

Anybody with eyes

Intern with a spreadsheet

In-house analytics

Industry-wide data consortium

NSA, non-proliferation

Page 18: Chicago Hadoop in Finance - Ted Dunning

18©MapR Technologies - Confidential

Net value optimum has a sharp peak well before maximum effort

Page 19: Chicago Hadoop in Finance - Ted Dunning

19©MapR Technologies - Confidential

But scaling laws are changing both slope and shape

Page 20: Chicago Hadoop in Finance - Ted Dunning

20©MapR Technologies - Confidential

More than just a little

Page 21: Chicago Hadoop in Finance - Ted Dunning

21©MapR Technologies - Confidential

They are changing a LOT!

Page 22: Chicago Hadoop in Finance - Ted Dunning

22©MapR Technologies - Confidential

Page 23: Chicago Hadoop in Finance - Ted Dunning

23©MapR Technologies - Confidential

Page 24: Chicago Hadoop in Finance - Ted Dunning

24©MapR Technologies - Confidential

Page 25: Chicago Hadoop in Finance - Ted Dunning

25©MapR Technologies - Confidential

Page 26: Chicago Hadoop in Finance - Ted Dunning

26©MapR Technologies - Confidential

Initially, linear cost scaling actually makes things worse

A tipping point is reached and things change radically …

Page 27: Chicago Hadoop in Finance - Ted Dunning

27©MapR Technologies - Confidential

Pre-requisites for Tipping

To reach the tipping point, Algorithms must scale out horizontally– On commodity hardware– That can and will fail

Data practice must change– Denormalized is the new black– Flexible data dictionaries are the rule– Structured data becomes rare

Page 28: Chicago Hadoop in Finance - Ted Dunning

28©MapR Technologies - Confidential

Yeah… but wait

Page 29: Chicago Hadoop in Finance - Ted Dunning

29©MapR Technologies - Confidential

The Standard Sort of Model

People talk about the law of large numbers as if it were …

Well, as if it were a law

It’s not …

It is a context and assumption dependent theorem

Page 30: Chicago Hadoop in Finance - Ted Dunning

30©MapR Technologies - Confidential

What if …

These assumptions are:

Changes have a – stationary, – independent, – finite variance distribution

What happens if these assumptions are wrong?

And which of them is really wrong?

Page 31: Chicago Hadoop in Finance - Ted Dunning

31©MapR Technologies - Confidential

For Example

Page 32: Chicago Hadoop in Finance - Ted Dunning

32©MapR Technologies - Confidential

End point has nice tractable distribution

Page 33: Chicago Hadoop in Finance - Ted Dunning

33©MapR Technologies - Confidential

What if the Assumptions are Wrong?

Take the finite variance as a simple example

This leads to Levy stable distributions

Like the Cauchy distribution

Page 34: Chicago Hadoop in Finance - Ted Dunning

34©MapR Technologies - Confidential

Is it Really Different?

Page 35: Chicago Hadoop in Finance - Ted Dunning

35©MapR Technologies - Confidential

Page 36: Chicago Hadoop in Finance - Ted Dunning

36©MapR Technologies - Confidential

What About Real Life?

Page 37: Chicago Hadoop in Finance - Ted Dunning

37©MapR Technologies - Confidential

Page 38: Chicago Hadoop in Finance - Ted Dunning

38©MapR Technologies - Confidential

But is it Really Infinite Variance?

Or are there other kinds of phenomena that show this?

What about the independence assumption?

What if the supposedly independent components of the system communicate?

Like we do. Everyday. All the time.

Page 39: Chicago Hadoop in Finance - Ted Dunning

39©MapR Technologies - Confidential

Why the Difference?

Law of large numbers

Infinitevariance

Interactingagents

Apologies and credit to Simon DaDeo, SFI

The space of all things that change

The space of interacting things

Page 40: Chicago Hadoop in Finance - Ted Dunning

40©MapR Technologies - Confidential

What Happens with Interactions

Social phenomena defeat the law of large numbers Distributions are well modeled by “rich get richer” processes– Pittman-Yar process, Indian Buffet

Limiting dstributions are heavy tailed, power law We see these distributions everywhere– price of cotton in the 19th century– word frequencies– popularity of Github projects– equity pricing and volumes– sizes of cities– popularity of web-sites

Page 41: Chicago Hadoop in Finance - Ted Dunning

41©MapR Technologies - Confidential

What are the Implications?

Page 42: Chicago Hadoop in Finance - Ted Dunning

42©MapR Technologies - Confidential

Page 43: Chicago Hadoop in Finance - Ted Dunning

43©MapR Technologies - Confidential

In a Nutshell

Scalability is much more important than we thought

Mashups are more important than we thought

Network effects are more important than we thought

Exploration is more important than we thought

Hadoop style linear scaling must be mixed with ad hoc analysis

Page 44: Chicago Hadoop in Finance - Ted Dunning

44©MapR Technologies - Confidential

Thank You

Page 45: Chicago Hadoop in Finance - Ted Dunning

45©MapR Technologies - Confidential

whoami?

Ted Dunning– @ted_dunning– [email protected] (MapR distribution for Hadoop)– [email protected] (Mahout, Hadoop, Lucene, Zookeeper, Drill)– [email protected] (me)

More info:

http://www.mapr.com/company/events/hadoop-in-finance-2012