45
1 ©MapR Technologies - Confidential Scalability in Hadoop and Similar Systems

Chicago finance-big-data

Embed Size (px)

DESCRIPTION

Talk about what scalability really means in terms of interacting processes and statistics of growth

Citation preview

Page 1: Chicago finance-big-data

1©MapR Technologies - Confidential

Scalability in Hadoop and Similar Systems

Page 2: Chicago finance-big-data

2©MapR Technologies - Confidential

Big is the next big thing

Big data and Hadoop are exploding

Companies are being funded

Books are being written

Applications sprouting up everywhere

2

Page 3: Chicago finance-big-data

3©MapR Technologies - Confidential

Slow Motion Explosion

3

Page 4: Chicago finance-big-data

4©MapR Technologies - Confidential

Hadoop Explosion

4

Page 5: Chicago finance-big-data

5©MapR Technologies - Confidential

Why Now?

But Moore’s law has applied for a long time

Why is Hadoop exploding now?

Why not 10 years ago?

Why not 20?

59/18/12

Page 6: Chicago finance-big-data

6©MapR Technologies - Confidential

Size Matters, but …

If it were just availability of data then existing big companies would adopt big data technology first

6

Page 7: Chicago finance-big-data

7©MapR Technologies - Confidential

Size Matters, but …

If it were just availability of data then existing big companies would adopt big data technology first

They didn’t

7

Page 8: Chicago finance-big-data

8©MapR Technologies - Confidential

Or Maybe Cost

If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

8

Page 9: Chicago finance-big-data

9©MapR Technologies - Confidential

Or Maybe Cost

If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

They didn’t

9

Page 10: Chicago finance-big-data

10©MapR Technologies - Confidential

Backwards adoption

Under almost any threshold argument startups would not adopt big data technology first

10

Page 11: Chicago finance-big-data

11©MapR Technologies - Confidential

Backwards adoption

Under almost any threshold argument startups would not adopt big data technology first

They did

11

Page 12: Chicago finance-big-data

12©MapR Technologies - Confidential

Everywhere at Once?

Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small

12

Page 13: Chicago finance-big-data

13©MapR Technologies - Confidential

Everywhere at Once?

Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small

Why?

13

Page 14: Chicago finance-big-data

14©MapR Technologies - Confidential

More data is being produced more quicklyData sizes are bigger than even a very large computer can holdCost to create and store continues to decrease

The Conventional Answer

BUSTED!

Page 15: Chicago finance-big-data

15©MapR Technologies - Confidential

Analytics Scaling Laws

Analytics scaling is all about the 80-20 rule – Big gains for little initial effort– Rapidly diminishing returns

The key to net value is how costs scale– Old school – exponential scaling– Big data – linear scaling, low constant

Cost/performance has changed radically– IF you can use many commodity boxes

Page 16: Chicago finance-big-data

16©MapR Technologies - Confidential

We knew that

We should have known that

We didn’t know that!

You’re kidding, people do that?

Page 17: Chicago finance-big-data

17©MapR Technologies - Confidential

Anybody with eyes

Intern with a spreadsheet

In-house analytics

Industry-wide data consortium

NSA, non-proliferation

Page 18: Chicago finance-big-data

18©MapR Technologies - Confidential

Net value optimum has a sharp peak well before maximum effort

Page 19: Chicago finance-big-data

19©MapR Technologies - Confidential

But scaling laws are changing both slope and shape

Page 20: Chicago finance-big-data

20©MapR Technologies - Confidential

More than just a little

Page 21: Chicago finance-big-data

21©MapR Technologies - Confidential

They are changing a LOT!

Page 22: Chicago finance-big-data

22©MapR Technologies - Confidential

Page 23: Chicago finance-big-data

23©MapR Technologies - Confidential

Page 24: Chicago finance-big-data

24©MapR Technologies - Confidential

Page 25: Chicago finance-big-data

25©MapR Technologies - Confidential

Page 26: Chicago finance-big-data

26©MapR Technologies - Confidential

Initially, linear cost scaling actually makes things worse

A tipping point is reached and things change radically …

Page 27: Chicago finance-big-data

27©MapR Technologies - Confidential

Pre-requisites for Tipping

To reach the tipping point, Algorithms must scale out horizontally– On commodity hardware– That can and will fail

Data practice must change– Denormalized is the new black– Flexible data dictionaries are the rule– Structured data becomes rare

Page 28: Chicago finance-big-data

28©MapR Technologies - Confidential

Yeah… but wait

Page 29: Chicago finance-big-data

29©MapR Technologies - Confidential

The Standard Sort of Model

People talk about the law of large numbers as if it were …

Well, as if it were a law

It’s not …

It is a context and assumption dependent theorem

Page 30: Chicago finance-big-data

30©MapR Technologies - Confidential

What if …

These assumptions are:

Changes have a – stationary, – independent, – finite variance distribution

What happens if these assumptions are wrong?

And which of them is really wrong?

Page 31: Chicago finance-big-data

31©MapR Technologies - Confidential

For Example

Page 32: Chicago finance-big-data

32©MapR Technologies - Confidential

End point has nice tractable distribution

Page 33: Chicago finance-big-data

33©MapR Technologies - Confidential

What if the Assumptions are Wrong?

Take the finite variance as a simple example

This leads to Levy stable distributions

Like the Cauchy distribution

Page 34: Chicago finance-big-data

34©MapR Technologies - Confidential

Is it Really Different?

Page 35: Chicago finance-big-data

35©MapR Technologies - Confidential

Page 36: Chicago finance-big-data

36©MapR Technologies - Confidential

What About Real Life?

Page 37: Chicago finance-big-data

37©MapR Technologies - Confidential

Page 38: Chicago finance-big-data

38©MapR Technologies - Confidential

But is it Really Infinite Variance?

Or are there other kinds of phenomena that show this?

What about the independence assumption?

What if the supposedly independent components of the system communicate?

Like we do. Everyday. All the time.

Page 39: Chicago finance-big-data

39©MapR Technologies - Confidential

Why the Difference?

Law of large numbers

Infinitevariance

Interactingagents

Apologies and credit to Simon DaDeo, SFI

The space of all things that change

The space of interacting things

Page 40: Chicago finance-big-data

40©MapR Technologies - Confidential

What Happens with Interactions

Social phenomena defeat the law of large numbers Distributions are well modeled by “rich get richer” processes– Pittman-Yar process, Indian Buffet

Limiting dstributions are heavy tailed, power law We see these distributions everywhere– price of cotton in the 19th century– word frequencies– popularity of Github projects– equity pricing and volumes– sizes of cities– popularity of web-sites

Page 41: Chicago finance-big-data

41©MapR Technologies - Confidential

What are the Implications?

Page 42: Chicago finance-big-data

42©MapR Technologies - Confidential

Page 43: Chicago finance-big-data

43©MapR Technologies - Confidential

In a Nutshell

Scalability is much more important than we thought

Mashups are more important than we thought

Network effects are more important than we thought

Exploration is more important than we thought

Hadoop style linear scaling must be mixed with ad hoc analysis

Page 44: Chicago finance-big-data

44©MapR Technologies - Confidential

Thank You

Page 45: Chicago finance-big-data

45©MapR Technologies - Confidential

whoami?

Ted Dunning– @ted_dunning– [email protected] (MapR distribution for Hadoop)– [email protected] (Mahout, Hadoop, Lucene, Zookeeper, Drill)– [email protected] (me)

More info:

http://www.mapr.com/company/events/hadoop-in-finance-2012