Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Accelerating the Core of the Cloud
David C. Stewart (@davest)
Agenda • Core software optimization as a strategy• Case study: OpenStack• Notes from the LAMP stack• Invitation to join us
2
Core software optimization: a strategy
How to make systems run fast • Performance analysis and improvement rarely taught in school
• Sometimes attempted in graduate level coursework• Developers pressured to come up with features quickly• We *should* write efficient algorithms
• We’re usually debugging or adding late features• Besides, aren’t CPUs getting faster all the time?
• Thanks to Moore’s law they are• But it takes time, and we can do better!
4 Photo credit: Genevieve Bell, Flickr
CPUs do get faster every generation
2004 2014
Relative Single Thread Instructions Per Cycle
(broad workload mixture)
Broadwell Converged-Core Higher performance than Haswell at iso-frequency
! Larger out-of-order scheduler ! Larger L2 TLB (1K "1.5K entries), new 1G L2 TLB (16 entries) ! 2nd page miss handler for parallel page walks ! Faster floating point multiplier (5"3 cycles), Radix-1,024 divider, faster vector Gather ! Improved address prediction for branches and returns
Power efficiency ! Performance features designed at ~2:1 Performance:Power ratio ! Power gating and design optimization increase efficiency at every operating point
Security ! Faster primitives for CRC, multiprecision arithmetic, and cryptography ! Supervisor Mode Access Protection to stifle privilege escalation attacks
Monitoring ! Architectural instruction tracing with Intel® Processor Trace ! Additional quality-of-service features for data centers
Virtualization ! 5-10% faster round-trip latency to VMM helps all virtualized usage models ! Additional features for fault tolerance in virtualized data centers ! Faster APIC virtualization improves a common source of virtualization overhead
Broadwell’s generational gain in Instruction Per-Cycle is similar
to Ivy Bridge.
• Example: “5th Generation Core” processor (Broadwell)
• Runs existing code faster• CPU design decisions based
on tracing workloads• Compiler developers are also
improving things• Is there a quicker way to get
improvements?
5
The “core software” strategy Don’t just speed up the program, speed up the compiler (or runtime) at the core
• Find a workload which is representative of real usages• Analyze system behavior using the best tools possible (develop the tools
if you have to)• Modify the compiler or runtime, rather than the application
• If the workload is representative of broad usages, you should speed up a lot of things, not just the application you are optimizing
• Lather, rinse, repeat!
6
Examples at Intel for core software optimization
Existing practices – 20+ years• C/C++ Compiler – both gcc and the Intel C Compiler• Java (OpenJDK)
Emerging practices• Python• PHP / HHVM• Go and Node.js
7
Case Study: OpenStack
OpenStack: Open source cloud software • Open source software for creating private and public clouds• Infrastructure as a Service, basis for several usages
• Software Defined Infrastructure• Software Defined Networking – with Open-daylight• Software Defined Storage
• Intel is very involved in making OpenStack great:• Platinum member of the OpenStack Foundation board of directors• Employee on the technical committee (Dean Troyer)• Top 10 contributor
9
OpenStack core software: Python • 70% or more of OpenStack is
written in Python – • good candidate for core software
focus• Python is being taught to new CS
students, significant use in HPC, machine learning, systems programming
• Swift – object storage in OpenStack
• Benchmarks: COSBench (Common Object Storage Benchmark, developed by Intel) and ssbench
10
STORAGE NODES
PROXY COSBench
AUTH
Python performance on Swift Storage Node • COSBench/Swift workload spends ~95% of cycles on the storage node (Xeon
processor, Codenamed “Haswell”)1. ~62% of cycles is Python (which is at user level)2. ~30% cycles is at kernel level
11
Swi8 Performance on HSW-‐EP Storage Node
Python Module
Lots of opportunity for improvement!
Python’s performance challenges • Significant gap to other languages• Interpreted language with rich
eco-system of add-ons• Common solution: Create a Just
in Time (JIT) compiler• Structural changes to the
interpreter break compatibility with libraries
• There is a JIT version (Pypy) but adoption is low due to library incompatibility
12
Workload Java7 Python 2.7.8 PyPy
K-‐NucleoMde (1Thread) 1 7.5 6.7 K-‐NucleoMde (MulM-‐thread) 1 8.3 14.2 Binary Trees 1 53.7 17.3
Reverse Complement 1 3.4 4.3 Josephus OO 1 115.2 3.0
Josephus List ReducMon 1 63.6 6.2
Josephus Recursion 1 19.1 7.6 FuncMon Calls 1 1392.9 11.4
Performance normalized on Java 7 Lower is beXer! Source: hXp://benchmarksgame.alioth.debian.org/
But first, a reminder of CPU architecture
Fetch read instruction bytes from memoryDecode translate instruction bytes into data path controlExecute perform the operation with a functional unitMemory read or write memory, if requiredCommitupdate architectural state
Fetch Decode Execute Commit Memory
Processor “front end” – cycles stalled here should be less than 20%
How loaded is the pipeline when we’re running Swib?
Architecture awareness
14
Shows Vtune data for Swib Object Server.
Python / Swib is Front-‐end bound (42%)!
Front-end bound causes Typical causes of front-end bound execution:
• Large code footprint• Misses in the Instruction Translation Lookaside Buffer
(ITLB)• Misses in the instruction cache• Branch misprediction
All are good candidates for optimization!
15
Let the compiler handle optimization! • PGO is helping in reducing FE issues. With PGO, the FE metric for Python2.7
module was reduced by up to ~13% for Swift• PGO on ssbench/Swift:
~10% boost on Xeon(BDW) Proxy + Atom(AVT) Storage node ~2.25% boost on Xeon (HSW) Proxy and Xeon(HSW) Storage node
• PGO gives ~10% boost for HPC oriented language workloads on both Xeon(HSW) and Atom (AVT)
16
1. HPC Workloads: 2. K-‐nucleoMde 3. N-‐body 4. Spectral-‐norm 5. Binary-‐trees (Workloads 1-‐5 taken from Language Benchmarks Game hXp://benchmarksgame.alioth.debian.org/u64/python.php ) 6. Josephus Problem hXps://github.com/dnene/josephus
• List reducMon • Object oriented • Element-‐recursion
PGO giving considerable benefit!
Python PGO Adoption • Idea: add (and maintain) a PGO-generated profile to the upstream Python project
sources• “Train” the compiler using a complex workload (initially pybench)• Generate a profile using the training• Submit the profile upstream to be used in compiling Python by default• Commit to maintaining the profile long term
• Current status:• Data collected showing significant improvement• No known regressions• Pull requests being created for both Python 2.7 and 3.5
• Potential downside: profile becomes “stale” with source changes• Next step: a patch which adds our training profile to the git repo
17
Python 2 Complications • Core services of OpenStack are implemented in Python 2.7• Porting to Python 3.4 is non-mechanical, a slow process• “no new features in Python 2.7” – Guido van Rossum• But … there is still considerable legacy code in Python 2 (like
OpenStack)• GOOD NEWS: Guido will allow performance patches to 2.7 so
long as they don’t complicate maintenance• First performance patch for 2.7 (computed GOTO) accepted in
June 2015!• Average improvement of 5% - ranging from -2% to +25% on GUPB
(Grand Unified Python Benchmark)
18
Accelerating the core Python is at the core
• We’re speeding up Python• Which will speed up Swift• Which (along with other pieces written in Python) will
speed up OpenStack
19
The P in LAMP
PHP acceleration • PHP has strong adoption as a
language:• Most popular language for websites• 80% of the 1.2B websites globally• 5M PHP developers worldwide
• Similar performance challenges to Python• No JIT, strong compatibility concerns for
major changes• Analysis showed function hotspots in
allocating/freeing• Zend addressed these and result was
significant improvement• PHP7 upcoming release
21
Zend Infographic: hXp://www.zend.com/en/resources/php7_infographic
HHVM optimization • HipHop VM, developed by Facebook
to accelerate their PHP code• Several assembly-level optimizations
have resulted in generous improvements in real customer workloads
• Facebook performance Lockdown in June• Tried auto-FDO: 5% improvement• Linker change to load hot functions
together in the cache (2% improvement)• Memory operations are 40% faster
than glibc in some cases, drive improvements
22
WordPress Improvement observed on Haswell
Compile with AVX2 enabled 5%
Memset() assembly tuning 1.8%
Memcpy() assembly tuning 3% (generic)
HHVM optimization • Analysis of front-end bound turned up poor I-cache use by linker generated shared
library related code (PLTs)<__gmon_start__@plt+0>: jmpq *0x2c1bfe2(%rip) # 0x36e0ee8 <[email protected]><__gmon_start__@plt+6>: pushq $0x1<__gmon_start__@plt+11>: jmpq 0xac4ee0<_ZdlPv@plt+0>: jmpq *0x2c1bfda(%rip) # 0x36e0ef0 <[email protected]><_ZdlPv@plt+6>: pushq $0x2<_ZdlPv@plt+11>: jmpq 0xac4ee0
• Separating the hot and cold portions of the code and packing together the hottest function calls resulted in a ~2% speedup of WordPress
jmpq *0x2de17da(%rip) # 0x38b1620 <[email protected]>xchg %ax,%axjmpq *0x2dde08a(%rip) # 0x38aded8 <[email protected]>xchg %ax,%axjmpq *0x2ddeaca(%rip) # 0x38ae920 <[email protected]>xchg %ax,%axjmpq *0x2de267a(%rip) # 0x38b24d8 <[email protected]>xchg %ax,%ax
• A pull request for binutils has been submitted23
This could yield an improvement for any executable that uses shared libraries heavily
Go optimization
24
• Core language for many new cloud projects• CloudFoundry, Kubernetes, Mesos, CoreOS/etcd, Docker
server
• Core function in CloudFoundry makes heavy use of bytes.Compare() method
• Written in assembly language
• Optimized through refactoring, vectorization and loop unrolling techniques (SSE2 instructions – coming: SSE4)
• Results: Runtime performance improved from 3%-to-36% depending on the size of input data
• Upstream for Go 1.6 version
• Many opportunities for further optimization
How are we doing?
25
• Performance of master changes daily
• You can’t improve what you don’t measure
• Needed: a community resource to build and measure performance
• “0-‐Day” lab: daily build and benchmark
• Results emailed to the community mailing lists
• Demonstrates support for the community
Invitation to Join Us
What’s next? • Our work starts with representative workloads• Python in particular is used in areas beyond OpenStack• We need your ideas for workloads to optimize!• Send your ideas to:
• python-dev mailing list• groups.google.com/forum/#!forum/golang-dev• github.com/facebook/hhvm• [email protected]
27
Thank you!
Thanks!
29