29
Accelerating the Core of the Cloud David C. Stewart (@davest)

Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Accelerating the Core of the Cloud

David C. Stewart (@davest)

Page 2: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Agenda •  Core software optimization as a strategy•  Case study: OpenStack•  Notes from the LAMP stack•  Invitation to join us

2  

Page 3: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Core software optimization: a strategy

Page 4: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

How to make systems run fast •  Performance analysis and improvement rarely taught in school

•  Sometimes attempted in graduate level coursework•  Developers pressured to come up with features quickly•  We *should* write efficient algorithms

•  We’re usually debugging or adding late features•  Besides, aren’t CPUs getting faster all the time?

•  Thanks to Moore’s law they are•  But it takes time, and we can do better!

4   Photo  credit:  Genevieve  Bell,  Flickr  

Page 5: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

CPUs do get faster every generation

2004 2014

Relative Single Thread Instructions Per Cycle

(broad workload mixture)

Broadwell Converged-Core Higher performance than Haswell at iso-frequency

!  Larger out-of-order scheduler !  Larger L2 TLB (1K "1.5K entries), new 1G L2 TLB (16 entries) !  2nd page miss handler for parallel page walks !  Faster floating point multiplier (5"3 cycles), Radix-1,024 divider, faster vector Gather !  Improved address prediction for branches and returns

Power efficiency !  Performance features designed at ~2:1 Performance:Power ratio !  Power gating and design optimization increase efficiency at every operating point

Security !  Faster primitives for CRC, multiprecision arithmetic, and cryptography !  Supervisor Mode Access Protection to stifle privilege escalation attacks

Monitoring !  Architectural instruction tracing with Intel® Processor Trace !  Additional quality-of-service features for data centers

Virtualization !  5-10% faster round-trip latency to VMM helps all virtualized usage models !  Additional features for fault tolerance in virtualized data centers !  Faster APIC virtualization improves a common source of virtualization overhead

Broadwell’s generational gain in Instruction Per-Cycle is similar

to Ivy Bridge.

•  Example: “5th Generation Core” processor (Broadwell)

•  Runs existing code faster•  CPU design decisions based

on tracing workloads•  Compiler developers are also

improving things•  Is there a quicker way to get

improvements?

5  

Page 6: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

The “core software” strategy Don’t just speed up the program, speed up the compiler (or runtime) at the core

•  Find a workload which is representative of real usages•  Analyze system behavior using the best tools possible (develop the tools

if you have to)•  Modify the compiler or runtime, rather than the application

•  If the workload is representative of broad usages, you should speed up a lot of things, not just the application you are optimizing

•  Lather, rinse, repeat!

6  

Page 7: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Examples at Intel for core software optimization

Existing practices – 20+ years•  C/C++ Compiler – both gcc and the Intel C Compiler•  Java (OpenJDK)

Emerging practices•  Python•  PHP / HHVM•  Go and Node.js

7  

Page 8: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Case Study: OpenStack

Page 9: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

OpenStack: Open source cloud software •  Open source software for creating private and public clouds•  Infrastructure as a Service, basis for several usages

•  Software Defined Infrastructure•  Software Defined Networking – with Open-daylight•  Software Defined Storage

•  Intel is very involved in making OpenStack great:•  Platinum member of the OpenStack Foundation board of directors•  Employee on the technical committee (Dean Troyer)•  Top 10 contributor

9  

Page 10: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

OpenStack core software: Python •  70% or more of OpenStack is

written in Python – •  good candidate for core software

focus•  Python is being taught to new CS

students, significant use in HPC, machine learning, systems programming

•  Swift – object storage in OpenStack

•  Benchmarks: COSBench (Common Object Storage Benchmark, developed by Intel) and ssbench

10  

STORAGE    NODES  

PROXY  COSBench  

AUTH  

Page 11: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Python performance on Swift Storage Node •  COSBench/Swift workload spends ~95% of cycles on the storage node (Xeon

processor, Codenamed “Haswell”)1.  ~62% of cycles is Python (which is at user level)2.  ~30% cycles is at kernel level

11  

Swi8  Performance  on  HSW-­‐EP  Storage  Node  

Python  Module  

Lots  of  opportunity  for  improvement!  

Page 12: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Python’s performance challenges •  Significant gap to other languages•  Interpreted language with rich

eco-system of add-ons•  Common solution: Create a Just

in Time (JIT) compiler•  Structural changes to the

interpreter break compatibility with libraries

•  There is a JIT version (Pypy) but adoption is low due to library incompatibility

12  

Workload Java7 Python  2.7.8 PyPy

K-­‐NucleoMde  (1Thread) 1 7.5 6.7 K-­‐NucleoMde  (MulM-­‐thread) 1 8.3 14.2 Binary  Trees 1 53.7 17.3

Reverse  Complement 1 3.4 4.3 Josephus  OO 1 115.2 3.0

Josephus  List  ReducMon 1 63.6 6.2

Josephus  Recursion 1 19.1 7.6 FuncMon  Calls 1 1392.9 11.4

Performance  normalized  on  Java  7  Lower  is  beXer!    Source:  hXp://benchmarksgame.alioth.debian.org/  

Page 13: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

But first, a reminder of CPU architecture

Fetch read instruction bytes from memoryDecode translate instruction bytes into data path controlExecute perform the operation with a functional unitMemory read or write memory, if requiredCommitupdate architectural state

Fetch Decode Execute Commit Memory

Processor  “front  end”  –  cycles  stalled  here  should  be  less  than  20%  

How  loaded  is  the  pipeline  when  we’re  running  Swib?  

Page 14: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Architecture awareness

14  

Shows  Vtune  data  for  Swib  Object  Server.  

Python  /  Swib  is  Front-­‐end  bound  (42%)!  

Page 15: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Front-end bound causes Typical causes of front-end bound execution:

•  Large code footprint•  Misses in the Instruction Translation Lookaside Buffer

(ITLB)•  Misses in the instruction cache•  Branch misprediction

All are good candidates for optimization!

15  

Page 16: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Let the compiler handle optimization! •  PGO is helping in reducing FE issues. With PGO, the FE metric for Python2.7

module was reduced by up to ~13% for Swift•  PGO on ssbench/Swift:

~10% boost on Xeon(BDW) Proxy + Atom(AVT) Storage node ~2.25% boost on Xeon (HSW) Proxy and Xeon(HSW) Storage node

•  PGO gives ~10% boost for HPC oriented language workloads on both Xeon(HSW) and Atom (AVT)

16  

1.  HPC  Workloads:  2.  K-­‐nucleoMde  3.  N-­‐body  4.  Spectral-­‐norm  5.  Binary-­‐trees  (Workloads  1-­‐5  taken  from  Language  Benchmarks  Game  hXp://benchmarksgame.alioth.debian.org/u64/python.php  )  6.      Josephus  Problem  hXps://github.com/dnene/josephus  

•  List  reducMon  •  Object  oriented  •  Element-­‐recursion  

PGO  giving  considerable  benefit!  

Page 17: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Python PGO Adoption •  Idea: add (and maintain) a PGO-generated profile to the upstream Python project

sources•  “Train” the compiler using a complex workload (initially pybench)•  Generate a profile using the training•  Submit the profile upstream to be used in compiling Python by default•  Commit to maintaining the profile long term

•  Current status:•  Data collected showing significant improvement•  No known regressions•  Pull requests being created for both Python 2.7 and 3.5

•  Potential downside: profile becomes “stale” with source changes•  Next step: a patch which adds our training profile to the git repo

17  

Page 18: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Python 2 Complications •  Core services of OpenStack are implemented in Python 2.7•  Porting to Python 3.4 is non-mechanical, a slow process•  “no new features in Python 2.7” – Guido van Rossum•  But … there is still considerable legacy code in Python 2 (like

OpenStack)•  GOOD NEWS: Guido will allow performance patches to 2.7 so

long as they don’t complicate maintenance•  First performance patch for 2.7 (computed GOTO) accepted in

June 2015!•  Average improvement of 5% - ranging from -2% to +25% on GUPB

(Grand Unified Python Benchmark)

18  

Page 19: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Accelerating the core Python is at the core

•  We’re speeding up Python•  Which will speed up Swift•  Which (along with other pieces written in Python) will

speed up OpenStack

19  

Page 20: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

The P in LAMP

Page 21: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

PHP acceleration •  PHP has strong adoption as a

language:•  Most popular language for websites•  80% of the 1.2B websites globally•  5M PHP developers worldwide

•  Similar performance challenges to Python•  No JIT, strong compatibility concerns for

major changes•  Analysis showed function hotspots in

allocating/freeing•  Zend addressed these and result was

significant improvement•  PHP7 upcoming release

21  

Zend  Infographic:  hXp://www.zend.com/en/resources/php7_infographic  

Page 22: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

HHVM optimization •  HipHop VM, developed by Facebook

to accelerate their PHP code•  Several assembly-level optimizations

have resulted in generous improvements in real customer workloads

•  Facebook performance Lockdown in June•  Tried auto-FDO: 5% improvement•  Linker change to load hot functions

together in the cache (2% improvement)•  Memory operations are 40% faster

than glibc in some cases, drive improvements

22  

WordPress  Improvement  observed  on  Haswell  

Compile  with  AVX2  enabled   5%  

Memset()  assembly  tuning   1.8%  

Memcpy()  assembly  tuning   3%  (generic)  

Page 23: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

HHVM optimization •  Analysis of front-end bound turned up poor I-cache use by linker generated shared

library related code (PLTs)<__gmon_start__@plt+0>: jmpq *0x2c1bfe2(%rip) # 0x36e0ee8 <[email protected]><__gmon_start__@plt+6>: pushq $0x1<__gmon_start__@plt+11>: jmpq 0xac4ee0<_ZdlPv@plt+0>: jmpq *0x2c1bfda(%rip) # 0x36e0ef0 <[email protected]><_ZdlPv@plt+6>: pushq $0x2<_ZdlPv@plt+11>: jmpq 0xac4ee0

•  Separating the hot and cold portions of the code and packing together the hottest function calls resulted in a ~2% speedup of WordPress

jmpq *0x2de17da(%rip) # 0x38b1620 <[email protected]>xchg %ax,%axjmpq *0x2dde08a(%rip) # 0x38aded8 <[email protected]>xchg %ax,%axjmpq *0x2ddeaca(%rip) # 0x38ae920 <[email protected]>xchg %ax,%axjmpq *0x2de267a(%rip) # 0x38b24d8 <[email protected]>xchg %ax,%ax

•  A pull request for binutils has been submitted23  

This  could  yield  an  improvement  for  any  executable  that  uses  shared  libraries  heavily  

Page 24: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Go optimization

24  

•  Core language for many new cloud projects•  CloudFoundry, Kubernetes, Mesos, CoreOS/etcd, Docker

server

•  Core function in CloudFoundry makes heavy use of bytes.Compare() method

•  Written in assembly language

•  Optimized through refactoring, vectorization and loop unrolling techniques (SSE2 instructions – coming: SSE4)

•  Results: Runtime performance improved from 3%-to-36% depending on the size of input data

•  Upstream for Go 1.6 version

•  Many opportunities for further optimization

Page 25: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

How are we doing?

25  

•  Performance  of  master  changes  daily  

•  You  can’t  improve  what  you  don’t  measure  

•  Needed:  a  community  resource  to  build  and  measure  performance  

•  “0-­‐Day”  lab:  daily  build  and  benchmark  

•  Results  emailed  to  the  community  mailing  lists  

•  Demonstrates  support  for  the  community  

Page 26: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Invitation to Join Us

Page 27: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

What’s next? •  Our work starts with representative workloads•  Python in particular is used in areas beyond OpenStack•  We need your ideas for workloads to optimize!•  Send your ideas to:

•  python-dev mailing list•  groups.google.com/forum/#!forum/golang-dev•  github.com/facebook/hhvm•  [email protected]

27  

Page 28: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Thank you!

Page 29: Accelerating the Core of the Cloud · • “no new features in Python 2.7” – Guido van Rossum • But … there is still considerable legacy code in Python 2 (like OpenStack)

Thanks!

29