46
Sun Tech Days / S un Studi0 - # 1 Build High Performance Apps On Multicore Systems Using Sun Studio Compilers and Tools Don Kretsch Senior Director, Sun Developer Tools Sun Microsystems 1

Build High Performance Apps on Multicore Systems Using Sun Studio Compilers and Tools

Embed Size (px)

Citation preview

Sun Tech Days / Sun Studi0 - # 1

Build High Performance AppsOn Multicore Systems Using Sun Studio Compilers and

Tools

Don KretschSenior Director, Sun Developer ToolsSun Microsystems

1

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 2

Microprocesor TrendsWhere's my 10GHz CPU?

Between 1993 and 1999, the average CPU clock speed increased tenfold; since then, it hasn't even doubledHistorical approach to performance by increasing: clock speed, pipelining, and cache is being negated by heat, power consumption, slow memory

The Clock Race is Over !

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 3

Multi-Core RevolutionPutting transistors to work in a new way

UltraSPARC T2Sun: 8 cores * 1.4GHz(64 threads in a chip)

Intel: Clovertown, AMD: BarcelonaIntel: 4 cores * 2.66GHzAMD: 4 cores * 2.0 GHz

(4 threads in a chip)

Every new system is powered by a multi-core chip !

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 4

Performance • Everyone loves fast

• Take advantage of latest HW features and performance attributes

Parallelism

• Multi-core is here! • Sun's Niagara2 leads with 64 threads/ 8 cores per chip

(Open)Platforms

• Linux, Solaris• SPARC and x86/x64• Equal treatment for all platforms

Productivity• IDE is important to speed development

• No dominant vendor/project Linux

Developer's Needs Have Changed

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 5

Parallelism• No single parallelism

model• Incredibly hard to

parallelize serial apps• Data Races and

deadlocks are common

Platforms• G++ incompatibilities• Constantly evolving

ABI on Linux ...• No uniformity in Linux

platforms

Productivity• Lack of advanced

toolchain• New generation uses

IDEs, but ...• Poor satisfaction with

C, C++ IDEs on Linux ...

Significant Challenges Remain

Performance• Architectures are

changing (too?) fast• Old tricks are no longer

sufficient

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 6

• Simplify Multi-core Development • Maximize Application

Performance• Single source for Linux and

Solaris, SPARC and x86• Modern, productive IDE• Sun Developer Services

Performance. Parallelism. Productivity. Platforms

6

Sun Studio 12

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 7

Performance Performance

Build Fast ApplicationsBuild Fast Applications

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 8

Best in Class SPECint2006:

69% Faster than IBM BladeCenter LS2128% Faster than HP Proliant BL20p G4

Compilers Deliver World Record Performance

Sun Blade X6250

Sun Blade X6220Best in Class SPECOMP M2001:

126% Faster than IBM/Power5

Sun Fire X4600

Best in Class SPECOMP L2001:

11% Faster than HP DL585 G2

Best in Class SPECOMP L2001:

11% Faster than HP DL585 G2

Best in Class SPECOMP M2001:

126% Faster than IBM/Power5

Best in Class SPECint_rate2006

X86 champ on SPECfp_rate2000

Fastest SPECfp_2000 system on planet (7/2006) beating even IBM Power5+

systems

WR count in past 12 months: 5 in SunBlade 6000 systems

10 in Sun Fire X4600; 1 on Sun Fire X45002 each in Sun Fire X2100_M2, X2200, X4100, X4200

Sun Fire X4200, X4100, X2200_M2,

X2100_M2Sun snatches two

WorldRecords in a brand new SPEC CPU2006 benchmark

Both SPECint2006 , SPECfp2006

Sun Niagara2 (8cores/64 Threads)

Best Chip score on SPECintrate 2006 and SPECfprate 2006

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 9

Maximize Application PerformanceSun compilers continue World Record Performance tradition> Set over 25 world records in the past 12 months- and more to come> World Records in EACH category: SPECint 2006, SPECfp 2006,

SPECintrate 2006, SPECfprate 2006 and SPEC OMP2001> World Records on each architecture from 1 core/1socket to 128

cores/64 sockets(scaling): UltraSPARC T2 (Niagara2), UltraSPARC-IV+, SPARC64 VI systems , Intel/Woodcrest, AMD/Opteron

> Sun SPARC Enterprise M9000 system tops 1-TeraFLOP barrierSignificant lead over GCC> 18% -52% on SPARC (SPEC2006)> 11% -18% on x86/AMD (SPEC2006)> 70% + on STREAMImproved optimized debugging abilitiesNew compilers make significant difference

over older releases as well as competitors

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 10

Runtime Performance Optimizations

X86 optimizationsP4, SSE2 instr in assembler Handle P4, SSE2 in inlinesSSE2 instruction schedulingStrength reductionBranch predictionInduction variable elimInvariant hoistingLoop interchangeLoop unswitchingAlignment of symbol blocksLoop unrollingAlignmentConstant propagationVectorization

UltraSPARC Optimizations

CommonOptimizations

Optimized Math libs

x86/x64 Optimizations

SPARC optimizationsBinary optimizations to improve cache localityNiagara, US-IV+, US-IIIi optimizations Modulo SchedulingBlock Scoped optimizationsLinkoptClass Hierarchy Analysis and OptimizationKPIC optimizationsNew CoolTools for UltraSPARC development: ATS, SPOT, ...

Optimized Math Librarieslibm, libmvec, libmopt, libmilLibsunmathMaximally optimized advanced math libraries (BLAS, FFT, LAPACK)MedliaLib, SSE(Math) intrinsics

Optimized performance for each target system: UltraSPARC, X86, and x64, for maximal system utilization

Highly optimized code generationAutomatic parallelization and vectorizationHigh-level loop transformationsInterprocedural optimizationsOptions to exploit advanced architecture pipelines, cache, chipsProfile-Guided OptimizationsAggressive inlining and cloningAdvanced OpenMP supportMore efficient machine resource utilization (throughput) Optimization of (-xbuiltin) callsInline template (assembly) codeAlias-based type disambiguationPrefetch support for newer systemsLinker scoped variables

Sun invests in compiler performance

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 11

Platforms Platforms

Unifying Solaris and Linux Unifying Solaris and Linux developmentdevelopment

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 12

Full Support for Solaris and LinuxBackground> Customers have heterogeneous

Solaris/Linux environments > Sun software portfolio supports Linux> Sun Studio 9 introduced partial support

for Linux– IDE, debugger, profiler> Sun Studio 12 added in compilers,

libraries, to complete the offeringKey Features> Complete feature set now available on

Linux – C, C++, and Fortran compilers, optimized libraries, tools, etc.

> Stable C++ ABI- now available for Linux> Improved GCC compatibility> Ease Multi-Platform Development> Fully enterprise-class support available

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 13

Compilers on LinuxSame features, same source, same components, same performance

C, C++, Fortran Language SystemsStandard C++ libraries, libgc, lint, GPC, ...Optimized Math libraries, including SunPerfLibOpenMP 2.5 APIs, TLS, MPI libraries, ...Popular G++ and GCC extensions, including>asm_inlines, __attribute__>g++ABI for interoperability>Linux Kernel compiled with Sun CompilersExpress Program: >3000 downloads, >800 active usersSun Studio Forum: >600+ messages

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 14

Be Smart. Be Compatible.Compatibility between releases> Allows developers to upgrade their

environment and continue innovating (versus reworking code)

> Leader in C++ ABI compatibility- link with objects produced by earlier versions

Enhanced GCC compatibility> Eases adoption of Sun Studio for

GCC-based developers> Improved source and binary

compatibility

Solaris Binary Compatibility Guarantee> With Sun Studio software> Source and binary compatibility

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 15

Parallelism:Parallelism:

Developing for a Multi-core Developing for a Multi-core

futurefuture

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 16

Compiler Support for Parallel Apps

Solaris

EventPorts

PosixThreads

SolarisThreads

AtomicOperations

libumem

Application

AutoPar MPIMT OpenMP

UltraSPARC T1/T2 SPARC64 VI,

UltraSPARC IV+

Intel/AMD x86/x64

Sun Studio Developer Tools

Easiest Hardest

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 17

AutoPar – Compiler Works for You

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 18

Autopar: SPECfp 2006 improvements

bwaves

gamess

milc zeusmp

gro-mac

cac-tusADM

leslie3d

namd

dealII

so-plex

povray

cal-culix

gemsFDT

tonto lbm wrf sphinx3

02.5

57.510

12.515

17.520

22.525

27.5

Woodcrest box: 3.0GHz dual-corePARALLEL=2

Overall Gain: 16%

Base Flags+ Autopar

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 19

Automatic VectorizationSupport for the Fortran, C and C++ applications

-xvector=simd exploits special SSE2+ instructionsWorks on data in adjacent memory locations

Gains are smaller than -xautoparSPECfp 2006 gains are 3% overall and upto 1-7% range individually

Best suited for loop-level SIMD parallelism

for (i=0; i<1024; i++)c[i] = a[i] * b[i]

for (i=0; i<1024; i+=4)c[i:i+3] = a[i:i+3] * b[i:i+3]

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 20

Tools support for Parallel AppsThread Analyzer

Detects data races and deadlocks in a multithreaded application

Points to non-deterministic or incorrect executionBugs are notoriously difficult to detect by examinationPoints out actual and potential deadlock situations

Process:Instrument the code with -xinstrument=dataraceDetect runtime condition with collect -r all [or race, detection] Use the Graphical Analyzer, tha, to identify conflicts and critical regions

Works with OpenMP, Pthreads, Solaris ThreadsAPI provided for user-defined synchronization primitivesWorks on Solaris (SPARC, x86/x64) and Linux

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 21

Tools support for Parallel Apps (2)Multi-thread aware Debugger>Browse, select, view active threads>Monitor thread entry point, PC, events, LWPs>Posix threads and OpenMP code debugginglock_lint stactic source code lock analyzer>Analyzes the use of mutex and multiple readers/single writer

locks>Reports on inconsistent usage of locks that may lead to

data races and deadlocksPerformance Analyzer support for all MT models

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 22

What is OpenMPDefacto industry standard API for writing shared-memory parallel applications in C, C++ and FortranConsists of> Compiler directives (pragmas)> Runtime routines (libmtsk)> Environment variables Advantages:> Incremental parallelization of source code> Small(er) amount of programming effort> Good Performance and Scalability> Portable across variety of vendor compilersSun Studio has consistently led in supporting the latest version (currently v2.5, work underway for v3.0)

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 23

OpenMP- Parallelization Directives

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 24

An OpenMP ExampleFind the primes up to 3,000,000 (216816)

Run on Sun Fire 6800, Solaris 9, 24 processors 1.2GHz US-III+, with 9.8GB main memory

Model # threads Time (secs) % changeSerial N/A 6.636 Base

OpenMP

1 7.210 8.65% drop2 3.771 1.76x faster4 1.988 3.34x faster8 1.090 6.09x faster

16 0.638 10.40x faster20 0.550 12.06x faster24 0.931 Saturation drop

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 25

Race Conditions – Tough Parallel Issues

a[0] = a[1] + b[0];

a[1] = a[2] + b[1];

a[2] = a[3] + b[2];

a[3] = a[4] + b[3];

a[4] = a[5] + b[4];

Thread 1

a[5] = a[6] + b[5];

a[6] = a[7] + b[6];

a[7] = a[8] + b[7];

a[8] = a[9] + b[8];

a[9] = a[10] + b[9];

Thread 2

for (i=1, i < n; i++)a[i] = a[1+1] + b[i];

Thread 1 writes 0-5 iterations; Thread 2 writes 5-9 iterations;a[5] could be written by Thread 2 before its read by Thread 1;

This is a Data Race condition

Sequential execution: Results are deterministicParalel execution: Results are non-deterministic

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 26

Design Practice to Avoid RacesAdopt a higher design abstraction (OpenMP today but this area will change in the future)Use Pass-by-value instead of pass-by-pointer to communicate between the threadsDesign the data structure to limit the global variable usage and restrict the access of shared memoryAnalyze a race problem to decide if it is a harmful program bug or a benign race Understand and fix the real cause of a race condition instead of fixing race condition symptom

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 27

ProductivityProductivity

Build applications fasterBuild applications faster

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 28

Integrated Graphical EnvironmentBased on NetBeans open source IDEDebugger and Performance Analyzer GUIsCode editor with syntax highlighting and code foldingCompile error hyperlinks to source code linesWizard for creating makefilesGUI layout editor / designer with X-DesignerHighly configurable

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 29

Debugger and Performance AnalyzerWorld's best debugger: dbx >Debug optimized, threaded, or OpenMP parallelized code>Graphical, point&click interface>Rich, programmable Event-triggered actions>Supports C, C++, Fortran and JavaBest Observability Tool: Performance Analyzer>Easy to use GUI, works with unmodified binaries, low overhead>Offers performance data at statement, instruction, routine level> Compiler Commentary on optimizations>Supports OpenMP, MPI, and Pthreads parallelization>DataSpace Profiling and hardware counter data>Supports C, C++, Fortran and Java

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 30

Faster Builds with Distributed makeDistributes build across a # of processes or a # of servers>Configuration file defines groups, #jobs in each group, each

machine in the group>Same syntax as make (different from gmake)>Communicates with compilers to maintain up-to-date

dependencies (.KEEP_STATE)#jobs dispatched scales with #CPUs and #nodes>3.6x improvement on Sol 9 (12:22 hours to 3:19 hours) for 4

CPUsAutomatic adjustment of #parallel jobsSun GRID Engine support and integration

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 31

Specialized Tools for Difficult ProblemsRunTime Checking for Memory leak issues:>Out of bounds access checks, memory leaks, memory

usageFix and Continue for quick recompile and reload (without restarting debugging session)Libgc – C/C++ garbage collector for memory allocation and heap managementSecure Lint for checking typical programming errors that impact security (e.g., buffer overflow)

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 32

Sun DeveloperSun DeveloperCommunityCommunity

andandTrainingTraining

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 33

Join the Sun Developer CommunitySDN membership gives youexclusive benefits:

Free developer toolsDiscounts for training, support, books, and hardwareAccess to technical content from SunTech Days and JavaOne OnlineParticipation in forums

http://developers.sun.com

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 34

Developer ServicesNeed help?> Developer email support for

Solaris Developer Express, Sun Studio, Java, and Java developer tools available

Also:> Sun Developer Service Plans

for Small to Medium Size Businesses

> Java Multi-Platform support for Enterprise developers and deployments

http://developers.sun.com/services

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 35

Sun Learning ServicesTraining on Software, Servers, Storage, and Services

Solaris 10 Training, Java EE 5 Training Top 5 Industry-Recognized Certifications

Solaris System Admin, Network Admin, Security, Java Programmer, Developer

Certified developers are paid 15%+ in salaryTrained employees reduce system downtime by as much as 49%SunStudio Web-based course is available NOW !http://www.sun.com/training/catalog/courses/WP-100-S10.xml

http: //www.sun.com/training

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 36

Performance, Parallelism, Productivity, ... and more

Popular GCC extensionsThreaded Debugger (dbx), dmakeMemory Leak Detection/Analysis (RTC)Thread Profiling, Thread AnalysisNetBeans-based IDEBinary Compatibility over 10 releasesTested with 400+ OpenSource appsCommunity, Support, TrainingSPARC, x86, x64 (AMD64, EM64T)Solaris and Linux: > Same source, components, features developers.sun.com/sunstudio

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 37

New Tools for Multi-core Development

1. Visit the Sun Studio Portal @ http: //developers.sun.com/sunstudio

√ Downloads, email forums, support, training, previews of new features, technical articles, etc

2. Try Sun Studio 12 – see how much it improves performance / throughput on the new UltraSPARC, Opteron, and Intel systems, even on Linux boxes(!)

√ Send us your experience, maybe we'll feature you at: http: //developers.sun.com/sunstudio/community/heroes.jsp

To Do List

Sun Tech Days / Sun Studi0 - # 38

Thank You !

Don KretschSenior Director, Sun Developer ToolsSun Microsystems

38

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 39

Sun StudioSun Studio

Performance TuningPerformance TuningCookBookCookBook

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 40

Experiences from Tunathons ...Program run to understand application performance, instead of focusing on standard benchmarksBetween 40 - 80 ISV or performance critical applications are considered for tuning and analysisGoals are to speedup app, identify compiler enhancements, and feedback for future system designsOpportunities range from:>Simple: find the best option, upgrade to new compiler>Easy: simple source change, found by simple analysis>Moderate: use of several analyzers, rewrites in assembly>Difficult: Complex analysis+tuning

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 41

Methodology / Tools UsedEnsure Best Builds:Latest CompilerOptimization flagsProfile feedbackInsert #pragmas

Identify Hot Spots:gprof( function timings)tcov( line counts)analyzer (many stats)

Check Libraries Used:optimized math libslibsunperfmedialibWrite special routines?

Get Execution Stats:cputrack(perf counters)locstat(lock containment)trapstat(traps)

Study and rewrite Source as appropriate

Study and rewrite assembly as appropriate

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 42

Changes that impact App Performance

1)Trading some behavior to get speed

2)Exploiting knowledge of the deployment environment

3)Exploiting knowledge of program characteristics

4)Source code changes

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 43

Compiler Options for Performance-xO1 thru -xO5 (default is no opt, -O implies -xO3)-fast: easy to use, best performance on most code, but it assumes compile platform = run platform and makes FP arithmetic simplicationsUnderstand program behavior and assert to optimizer:> -xrestrict: if only restricted pointers are passed to functions> -xalias_level: if pointers behave in certain ways> -fsimple: if FP arithmetic can be simplifiedTarget machine-related:> -xprefetch, -xprefetch_level > -xtarget=, -xarch=, -xcache=, -xchip= > -xvector: converts DO loops into vector instr/calls

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 44

Compiler Options for PerformanceAdvanced Compiler options> -xprofile: profile-feedback enabled optimizations> -xcrossfile, -xipo: performs crossfile/interprocedural

optimizations> -xautopar: enable automatic parallelization> -xdepend: performance dependence analysisUse optimized math libraries>Sun Performance library for algebraic functions>Vectorized math routines (libmvec)> Inline (libmil) and optimized math (libmopt)>Value-added math library (libsunmath)

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 45

Source Code ChangesImprove usage of data cache, TLB, register windows> Use VIS instruction (templates) directly (via -xvis)> Optimize data alignment (also: #pragma align)> Prevent Register Window OverflowCreating inline assembly templates for performance critical routinesLoop Optimizations that compilers may miss:> Prevent Register Window Overflow> Restructuring for pipelining and prefetching> Loop splitting/fission> Loop Peeling> Loop interchange> Loop unrolling and tiling> Pragma directed

Sun Studio Compilers and Tools

Sun Tech Days / Sun Studi0 - # 46

Gains from Tuning Categories

Tuning Category Typical Range of Gain

Source Change 25-100%

Compiler Flags 5-20%

Use of libraries 25-200%

Assembly coding / tweaking 5-20%

Manual prefetching 5-30%

TLB thrashing/cache 20-100%

Using vis/inlines/micro-vectorization 100-200%