Chicago Hadoop in Finance - Ted Dunning

Preview:

DESCRIPTION

Talk about what scalability really means in terms of interacting processes and statistics of growth by Ted Dunning

Citation preview

1©MapR Technologies - Confidential

Scalability in Hadoop and Similar Systems

2©MapR Technologies - Confidential

Big is the next big thing

Big data and Hadoop are exploding

Companies are being funded

Books are being written

Applications sprouting up everywhere

2

3©MapR Technologies - Confidential

Slow Motion Explosion

3

4©MapR Technologies - Confidential

Hadoop Explosion

4

5©MapR Technologies - Confidential

Why Now?

But Moore’s law has applied for a long time

Why is Hadoop exploding now?

Why not 10 years ago?

Why not 20?

59/18/12

6©MapR Technologies - Confidential

Size Matters, but …

If it were just availability of data then existing big companies would adopt big data technology first

6

7©MapR Technologies - Confidential

Size Matters, but …

If it were just availability of data then existing big companies would adopt big data technology first

They didn’t

7

8©MapR Technologies - Confidential

Or Maybe Cost

If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

8

9©MapR Technologies - Confidential

Or Maybe Cost

If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

They didn’t

9

10©MapR Technologies - Confidential

Backwards adoption

Under almost any threshold argument startups would not adopt big data technology first

10

11©MapR Technologies - Confidential

Backwards adoption

Under almost any threshold argument startups would not adopt big data technology first

They did

11

12©MapR Technologies - Confidential

Everywhere at Once?

Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small

12

13©MapR Technologies - Confidential

Everywhere at Once?

Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small

Why?

13

14©MapR Technologies - Confidential

More data is being produced more quicklyData sizes are bigger than even a very large computer can holdCost to create and store continues to decrease

The Conventional Answer

BUSTED!

15©MapR Technologies - Confidential

Analytics Scaling Laws

Analytics scaling is all about the 80-20 rule – Big gains for little initial effort– Rapidly diminishing returns

The key to net value is how costs scale– Old school – exponential scaling– Big data – linear scaling, low constant

Cost/performance has changed radically– IF you can use many commodity boxes

16©MapR Technologies - Confidential

We knew that

We should have known that

We didn’t know that!

You’re kidding, people do that?

17©MapR Technologies - Confidential

Anybody with eyes

Intern with a spreadsheet

In-house analytics

Industry-wide data consortium

NSA, non-proliferation

18©MapR Technologies - Confidential

Net value optimum has a sharp peak well before maximum effort

19©MapR Technologies - Confidential

But scaling laws are changing both slope and shape

20©MapR Technologies - Confidential

More than just a little

21©MapR Technologies - Confidential

They are changing a LOT!

22©MapR Technologies - Confidential

23©MapR Technologies - Confidential

24©MapR Technologies - Confidential

25©MapR Technologies - Confidential

26©MapR Technologies - Confidential

Initially, linear cost scaling actually makes things worse

A tipping point is reached and things change radically …

27©MapR Technologies - Confidential

Pre-requisites for Tipping

To reach the tipping point, Algorithms must scale out horizontally– On commodity hardware– That can and will fail

Data practice must change– Denormalized is the new black– Flexible data dictionaries are the rule– Structured data becomes rare

28©MapR Technologies - Confidential

Yeah… but wait

29©MapR Technologies - Confidential

The Standard Sort of Model

People talk about the law of large numbers as if it were …

Well, as if it were a law

It’s not …

It is a context and assumption dependent theorem

30©MapR Technologies - Confidential

What if …

These assumptions are:

Changes have a – stationary, – independent, – finite variance distribution

What happens if these assumptions are wrong?

And which of them is really wrong?

31©MapR Technologies - Confidential

For Example

32©MapR Technologies - Confidential

End point has nice tractable distribution

33©MapR Technologies - Confidential

What if the Assumptions are Wrong?

Take the finite variance as a simple example

This leads to Levy stable distributions

Like the Cauchy distribution

34©MapR Technologies - Confidential

Is it Really Different?

35©MapR Technologies - Confidential

36©MapR Technologies - Confidential

What About Real Life?

37©MapR Technologies - Confidential

38©MapR Technologies - Confidential

But is it Really Infinite Variance?

Or are there other kinds of phenomena that show this?

What about the independence assumption?

What if the supposedly independent components of the system communicate?

Like we do. Everyday. All the time.

39©MapR Technologies - Confidential

Why the Difference?

Law of large numbers

Infinitevariance

Interactingagents

Apologies and credit to Simon DaDeo, SFI

The space of all things that change

The space of interacting things

40©MapR Technologies - Confidential

What Happens with Interactions

Social phenomena defeat the law of large numbers Distributions are well modeled by “rich get richer” processes– Pittman-Yar process, Indian Buffet

Limiting dstributions are heavy tailed, power law We see these distributions everywhere– price of cotton in the 19th century– word frequencies– popularity of Github projects– equity pricing and volumes– sizes of cities– popularity of web-sites

41©MapR Technologies - Confidential

What are the Implications?

42©MapR Technologies - Confidential

43©MapR Technologies - Confidential

In a Nutshell

Scalability is much more important than we thought

Mashups are more important than we thought

Network effects are more important than we thought

Exploration is more important than we thought

Hadoop style linear scaling must be mixed with ad hoc analysis

44©MapR Technologies - Confidential

Thank You

45©MapR Technologies - Confidential

whoami?

Ted Dunning– @ted_dunning– tdunning@maprtech.com (MapR distribution for Hadoop)– tdunning@apache.com (Mahout, Hadoop, Lucene, Zookeeper, Drill)– ted.dunning@gmail.com (me)

More info:

http://www.mapr.com/company/events/hadoop-in-finance-2012

Recommended