50
High Performance Erlang: Pitfalls and Solutions MZSPEED Team Machine Zone, Inc. 1 Machine Zone, Inc

High Performance Erlang - Pitfalls and Solutions

Embed Size (px)

Citation preview

Page 1: High Performance Erlang - Pitfalls and Solutions

High Performance Erlang: Pitfalls and Solutions

MZSPEED Team

Machine Zone, Inc.

1 Machine Zone, Inc

Page 2: High Performance Erlang - Pitfalls and Solutions

Agenda  

•  Erlang at Machine Zone •  Issues and Roadblocks •  Lessons Learned •  Profiling Techniques •  Case Study: Counters

2 Machine Zone, Inc

Page 3: High Performance Erlang - Pitfalls and Solutions

Erlang  at  Machine  Zone  

•  High performance data processing system •  How to make it fast?

3 Machine Zone, Inc

Page 4: High Performance Erlang - Pitfalls and Solutions

4

Typical  Approach:  Pipelined  tasks  

Execute different logic in different processes

receiver processor sender

Machine Zone, Inc

Page 5: High Performance Erlang - Pitfalls and Solutions

How do we exchange data between processes?

How do we make it fast?

5 Machine Zone, Inc

Page 6: High Performance Erlang - Pitfalls and Solutions

ETS

6 Machine Zone, Inc

Page 7: High Performance Erlang - Pitfalls and Solutions

0

5 M

10 M

15 M

0

2

4

6

8

10

12

14

16

Upd

ate

oper

atio

ns (t

otal

)

Num

ber o

f pro

cess

es u

pdat

ing

Timeline

expectedprocesses updating

0

5 M

10 M

15 M

0

2

4

6

8

10

12

14

16

Upd

ate

oper

atio

ns (t

otal

)

Num

ber o

f pro

cess

es u

pdat

ing

Timeline

expectedprocesses updating

update operations

7 Machine Zone, Inc

ETS:  update_counter,  10  Keys:    Expected  vs  Observed  Performance  

•  Observed Performance is far worse than Expected

Page 8: High Performance Erlang - Pitfalls and Solutions

8 Machine Zone, Inc

ETS:  update_counter,  Multiple  Keys:  Observed  Performance  

0

5 M

10 M

15 M

0

2

4

6

8

10

12

14

16

Upd

ate

oper

atio

ns (t

otal

)

Num

ber o

f pro

cess

es u

pdat

ing

Timeline

1 key10 keys

100 keys1000 keys

•  Performance is degraded even with multiple keys

Page 9: High Performance Erlang - Pitfalls and Solutions

9 Machine Zone, Inc

ETS  has  Locks  

•  Fine-grained locks •  64 locks per table by default

Page 10: High Performance Erlang - Pitfalls and Solutions

Message Passing

10 Machine Zone, Inc

Page 11: High Performance Erlang - Pitfalls and Solutions

1 10 100 1000 N

1

10

100

M

11 Machine Zone, Inc

Message  Passing:    M  Processes  to  N  Processes  

10:100

Page 12: High Performance Erlang - Pitfalls and Solutions

12 Machine Zone, Inc

Message  Passing:  Scheduling  Locks  

•  Add to runq with locks

Page 13: High Performance Erlang - Pitfalls and Solutions

13 Machine Zone, Inc

Message  Passing:  Scheduler  Communication  Locks  

•  Task stealing with locks

Page 14: High Performance Erlang - Pitfalls and Solutions

14 Machine Zone, Inc

Message  Passing  

•  Limit on #messages delivered per second

•  < 10M msgs/sec on MacBook Pro

•  < 22M msgs/sec on a powerful 56-core server

•  Independent of #processes

Page 15: High Performance Erlang - Pitfalls and Solutions

What if we want to achieve more?

15 Machine Zone, Inc

Page 16: High Performance Erlang - Pitfalls and Solutions

16 Machine Zone, Inc

Solution  1:  Minimize  Message  Exchange  Do everything in the same process and shard

receiver processor

sender

receiver processor

sender

receiver processor

sender

Page 17: High Performance Erlang - Pitfalls and Solutions

17 Machine Zone, Inc

Solution  1  Speed  

•  100x the #ETS updates •  20x–30x the #message exchanges •  No decrease with increase in #processes

0

20 M

40 M

60 M

80 M

100 M

0

20

40

60

80

100

Itera

tions

per

sec

Proc

esse

s

iterationsprocesses

Page 18: High Performance Erlang - Pitfalls and Solutions

18 Machine Zone, Inc

Solution  2:  Message  Batching  Batch messages: 5x–30x more msgs/sec

1 10 100 1000 N

1

10

100

M

10:100

Page 19: High Performance Erlang - Pitfalls and Solutions

19 Machine Zone, Inc

Solution  3:  Domain-­‐Specific  Optimizations  

•  Deliver one message to

multiple recipients:

1.  Receive, process, store to

shared memory using NIF

2.  Read from the shared

memory, send using push

notification

sender

receiver processor

shared memory

Total throughput of 600M msgs/sec to many recipients

shared memory

Page 20: High Performance Erlang - Pitfalls and Solutions

20 Machine Zone, Inc

High Performance Erlang: Lessons Learned

Page 21: High Performance Erlang - Pitfalls and Solutions

21 Machine Zone, Inc

Lesson  1:  Use  Sharding  

•  Avoid having single process •  Use a pool of non-communicating shards •  erlang:phash2(Key, NumberOfShards)

Shard 2

Shard 1 receive  

receive  

send  

send  

No  communication  

Page 22: High Performance Erlang - Pitfalls and Solutions

0

2 M

4 M

6 M

8 M

10 M

12 M

0 1 M 2 M 3 M 4 M 5 M

Tim

e ta

ken

(us)

Number of insertions

ets setmaps

process dictionary

22 Machine Zone, Inc

Lesson  2:  Use  Process  Dictionary  •  Quicker insertions in process dictionary compared to

maps

Page 23: High Performance Erlang - Pitfalls and Solutions

0

5 k

10 k

15 k

20 k

25 k

1 10 100 1000 10000 100000

Tim

e fo

r 100

k up

date

s (m

s)

Number of elements

R18 mapsR18 process dictionary

23 Machine Zone, Inc

Lesson  2:  Use  Process  Dictionary  

•  Quicker update operations in process dictionary compared to maps (Erlang 18)

Page 24: High Performance Erlang - Pitfalls and Solutions

24 Machine Zone, Inc

Lesson  3:  Use  Limited  #Processes  

•  Don’t create too many processes •  pid2proc lock for process resolution

•  Scheduler has to manage the processes •  Fixed number of schedulers

•  Don’t spawn/stop processes frequently •  Process resolution again

Page 25: High Performance Erlang - Pitfalls and Solutions

Profiling Techniques

25 Machine Zone, Inc

Page 26: High Performance Erlang - Pitfalls and Solutions

26 Machine Zone, Inc

pstack:  Stack  Sampling  •  pstack (gdb): sampling call stack over multiple threads

•  $ pstack 1234 •  $ gdb -p 1234 •  (gdb) thread apply all bt 25

Page 27: High Performance Erlang - Pitfalls and Solutions

27 Machine Zone, Inc

EEP:  Erlang  Easy  Profiling  

•  https://github.com/virtan/eep •  Based on dbg module, low overhead trace ports •  Easy to use •  Kcachegrind visualization

Basic Usage •  Collect runtime data to local file

•  > eep:start_file_tracing("abc"), timer:sleep(10000), eep:stop_tracing().

•  Convert to callgrind format •  > eep:convert_tracing(“abc").

•  Visualize with kcachegrind •  $ kcachegrind callgrind.out.abc

Page 28: High Performance Erlang - Pitfalls and Solutions

28 Machine Zone, Inc

EEP:    Example  Visualization  

Page 29: High Performance Erlang - Pitfalls and Solutions

29 Machine Zone, Inc

oprofile:  System  Level  Profiling  

•  Initialization $ opcontrol –deinit $ echo 0 > /proc/sys/kernel/nmi_watchdog •  Profiling $ operf --vmlinux /usr/lib/debug/vmlinux --callgraph ./rebar eunit apps=perf_test •  Reporting and Visualization $ opreport -cgf | ./gprof2dot.py -f oprofile --show-samples | dot -Tpng -o output.png

Page 30: High Performance Erlang - Pitfalls and Solutions

30 Machine Zone, Inc

oprofile:  Example  

Page 31: High Performance Erlang - Pitfalls and Solutions

Case Study: Counters

31 Machine Zone, Inc

Page 32: High Performance Erlang - Pitfalls and Solutions

32 Machine Zone, Inc

Need  for  High  Performance  Counters  

•  Thousands of global counters

•  Updated by several hundred processes every second

Page 33: High Performance Erlang - Pitfalls and Solutions

33 Machine Zone, Inc

Libraries  Used  

•  Exometer

•  ETS

•  Folsom

Page 34: High Performance Erlang - Pitfalls and Solutions

Benchmark  Environment  

•  Platform •  2 CPU sockets •  CPU - Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz •  14 hardware threads per socket •  2 hyper-threaded threads for each hardware thread

•  Erlang Runtime  Erlang/OTP 18 [erts-7.1] [source] [64-bit] [smp:56:56]

[async-threads:10] [kernel-poll:false]

34 Machine Zone, Inc

Page 35: High Performance Erlang - Pitfalls and Solutions

0

1 M

2 M

3 M

4 M

5 M

6 M

7 M

10 20 30 40 50 60

Num

cou

nter

upd

ates

per

pro

cess

Num processes

ets counterexometer counter

exometer fast counterfolsom counter

35 Machine Zone, Inc

Benchmark  Results  •  As number of processes increases, number of

counter updates per process decreases

Page 36: High Performance Erlang - Pitfalls and Solutions

36 Machine Zone, Inc

Limitations  

•  Exometer: •  counters – scales well but slow •  fast counters – doesn’t scale

•  Lock contention in ets:update_counter()

Page 37: High Performance Erlang - Pitfalls and Solutions

MZMETRICS

Our Solution:

37 Machine Zone, Inc

Page 38: High Performance Erlang - Pitfalls and Solutions

38 Machine Zone, Inc

Refresher  on  Cacheline  

Source:  http://www.slideshare.net/yzt/modern-­‐cpus-­‐and-­‐caches-­‐a-­‐starting-­‐point-­‐for-­‐programmers/52  

Page 39: High Performance Erlang - Pitfalls and Solutions

39 Machine Zone, Inc

Occurs when two CPUs access different data structures on the same cache line

False  Sharing  (SMP)  

CPUO CPU1

struct foo struct bar

Cache  line  

Page 40: High Performance Erlang - Pitfalls and Solutions

40

CPUO

Main memory

L2

L2

CPU1

foo bar foo bar foo bar

FETCH  foo   FETCH  bar  

FETCH  foo   FETCH  bar  

False  Sharing  (SMP)  

Machine Zone, Inc

Page 41: High Performance Erlang - Pitfalls and Solutions

Main memory

41

False  Sharing  (SMP)  

CPUO

L2

L2

CPU1

foo bar

WRITE  foo  

WRITE  foo  

INVL  bar  

INVL  bar  

foo bar

foo bar

ACK  

ACK  ACK  

ACK  

Machine Zone, Inc

Page 42: High Performance Erlang - Pitfalls and Solutions

Main memory

42

CPUO

L2

L2

CPU1

WRITE  bar  

WRITE  bar  INVL  foo  

INVL  foo  

foo bar

ACK  

ACK  ACK  

ACK  

Machine Zone, Inc

False  Sharing  (SMP)  

Page 43: High Performance Erlang - Pitfalls and Solutions

43 Machine Zone, Inc

Benefits  of  Per-­‐CPU  Data  

Source:  http://lwn.net/Articles/256433/  

Time taken for 500 million counter increment operations on Intel Core 2 QX 6700

Page 44: High Performance Erlang - Pitfalls and Solutions

44 Machine Zone, Inc

Benefits  of  Per-­‐CPU  Data:  Reasons  

•  Less overhead of cache coherency protocol

•  Significantly higher benefits with QPI (Quick Path Interconnect)

Page 45: High Performance Erlang - Pitfalls and Solutions

45 Machine Zone, Inc

Transactional  Memory  

•  Intel TSX (Transactional Synchronizations Extensions)

•  Helps with locks in small critical sections

•  Not yet available in x86 (disabled due to hardware bugs)

Page 46: High Performance Erlang - Pitfalls and Solutions

46 Machine Zone, Inc

MZMETRICS  

•  Design based on per-CPU counters •  Counter Operations: •  Create a new counter – requires lock •  Read/update counter – no lock required

•  Delete counter – requires lock

•  Implemented using NIF

Page 47: High Performance Erlang - Pitfalls and Solutions

47 Machine Zone, Inc

MZMETRICS  Evaluation    

0 2 M4 M6 M8 M

10 M12 M14 M16 M18 M20 M

10 20 30 40 50

Num

cou

nter

upd

ates

per

pro

cess

Num processes

exometermzmetrics

•  10X faster than Exometer counters

Page 48: High Performance Erlang - Pitfalls and Solutions

48 Machine Zone, Inc

MZMETRICS  Evaluation:  Results    

•  1 million counter updates with 48 processes •  Exometer takes ~ 1 seconds •  MZMETRICS takes ~0.1 seconds

Page 49: High Performance Erlang - Pitfalls and Solutions

49 Machine Zone, Inc

Takeaways  •  Pitfalls •  Message passing becomes expensive at scale •  Large number of processes needing pid

resolution from a global table causes contention

•  Existing libraries not amiable for fast counting

•  Proposed Solutions •  Use sharding & Limit messaging passing •  Limit number of active processes •  Use MZMETRICS for fast counters

Page 50: High Performance Erlang - Pitfalls and Solutions

50 Machine Zone, Inc

We are Hiring http://www.machinezone.com/careers