1
. BioPerf is a suite of representative applications that we have assembled from the computational biology community, where the codes are carefully selected to span a breadth of algorithms and performance characteristics. We have analyzed the complexity of these codes, at the instruction and memory level, using “live” and aggregate data on contemporary high-performance architectures (Apple G5 with the IBM PowerPC 970), and on the IBM cycle- accurate simulator Mambo, previously used to design supercomputers such as IBM p- Series and BlueGene, and currently being used to model future systems. Hence, our work is novel in that it is one of the first efforts to incorporate life science application performance for optimizing high-end computer system architectures. Through dual-platform performance analysis, we offer system design parameters for machine configurations that may improve the performance of these codes. We target this suite for impact to both biologists and computer scientists for the evaluation of systems running bioinformatics applications. Incorporating Life Science Applications into the Architectural Optimizations of Next-Generation Petaflops Systems Introduction Performance Analysis BioPerf David A. Bader Georgia Institute of Technology Vipin Sachdeva University of New Mexico The following codes are included in BioPerf Package Executable BLAST blastn, blastp FASTA fasta34_t, ssearch34_t CLUSTALW clustalw, clustalw_smp HMMER hmmsearch, hmmpfam T-COFFEE tcoffee GLIMMER glimmer, glimmer-package PHYLIP dnapenny, promlk GRAPPA grappa CE ce PREDATOR predator 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Instr. com pleted/cycle L1d M issRate L1d M issRate 0 200 400 600 800 1000 1200 1400 1600 1800 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.00 0.01 0.02 0.03 0.04 0.05 InstructionsCom pleted/Cycle L1D Cache M isses InstructionsCom pleted/Cycle 0 200 400 600 800 1000 1200 1400 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Instr. Com pleted/Cycle L1D Cache M issRate Blast, hmmpfam, and tcoffee performance graphs (from left to right) Instructions per cycle increases in the same cycle that the L1 data miss rate decreases. We can thus correlate the performance of the application as it varies, with the system metrics impacting it. Methodology Performance Analysis through live data Cumulative Metrics Separated Regions of Performance Performance Analysis through live data 6.0e+8 8.0e+8 1.0e+9 1.2e+9 1.4e+9 1.6e+9 1.8e+9 2.0e+9 2.2e+9 2.4e+9 0 5e+4 1e+5 2e+5 2e+5 3e+5 3e+5 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 IN ST_TYPE_AR ITH L1D_CACHE_M ISS_USER 2.20e+9 2.22e+9 2.24e+9 2.26e+9 2.28e+9 2.30e+9 2.32e+9 2.34e+9 15000 20000 25000 30000 35000 40000 45000 50000 0 10000 20000 30000 40000 IN ST_TYPE_AR ITH L1D_CACHE_M ISS_USER 2.2e+9 2.4e+9 2.6e+9 2.8e+9 3.0e+9 3.2e+9 3.4e+9 3.6e+9 3.8e+9 0 1e+5 2e+5 3e+5 4e+5 5e+5 0 10000 20000 30000 40000 IN ST_TYPE_AR ITH L1D_CACHE_M ISS_USER ClustalW region I (top), II (bottom left) and III (bottom right) showing differences in algorithmic complexity and memory access pattern. Clustalw’s performance is roughly categorized into three regions. Every sequence is compared against every other sequence by Smith Waterman, a quadratic time complexity dynamic programming algorithm, The neighbor joining method in which comparison score of sequences is used to make a guide tree with the sequences at the leaves of the tree The sequences are combined into a multiple sequence Instruction Profiling 0 500 1000 1500 2000 2500 3000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.000 0.005 0.010 0.015 0.020 0.025 0.030 Instr. Com pleted/Cycle Branch M ispredicts/Instr. 0 500 1000 1500 2000 2500 3000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.00 0.05 0.10 0.15 0.20 0.25 Instr.Completed/Cycle L1d M iss Rate Clustalw’s livegraphs with L1d miss rate and branch mispredicts (top left and right). Performance of the last phase of clustalw is more closely related to branch mispredicts than L1 data miss rate. www.bioperf.or g

Introduction

Embed Size (px)

DESCRIPTION

Incorporating Life Science Applications into the Architectural Optimizations of Next-Generation Petaflops Systems. The following codes are included in BioPerf Package Executable BLAST blastn, blastp FASTA fasta34_t, ssearch34_t - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction

. BioPerf is a suite of representative applications that we have assembled from the computational biology community, where the codes are carefully selected to span a breadth of algorithms and performance characteristics. We have analyzed the complexity of these codes, at the instruction and memory level, using “live” and aggregate data on contemporary high-performance architectures (Apple G5 with the IBM PowerPC 970), and on the IBM cycle-accurate simulator Mambo, previously used to design supercomputers such as IBM p-Series and BlueGene, and currently being used to model future systems. Hence, our work is novel in that it is one of the first efforts to incorporate life science application performance for optimizing high-end computer system architectures. Through dual-platform performance analysis, we offer system design parameters for machine configurations that may improve the performance of these codes. We target this suite for impact to both biologists and computer scientists for the evaluation of systems running bioinformatics applications.

Incorporating Life Science Applications into the Architectural Optimizations of Next-Generation Petaflops Systems

Introduction Performance Analysis

BioPerf

David A. BaderGeorgia Institute of Technology

Vipin Sachdeva University of New Mexico

The following codes are included in BioPerf

Package ExecutableBLAST blastn, blastpFASTA fasta34_t, ssearch34_tCLUSTALW clustalw, clustalw_smpHMMER hmmsearch, hmmpfamT-COFFEE tcoffeeGLIMMER glimmer, glimmer-packagePHYLIP dnapenny, promlkGRAPPA grappaCE cePREDATOR predator

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Instr. completed/cycle L1d Miss Rate

L1d

Mis

s R

ate

0 200 400 600 800 1000 1200 1400 1600 1800

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0.00

0.01

0.02

0.03

0.04

0.05

Instructions Completed/Cycle L1D Cache Misses

Inst

ruct

ions

Com

plet

ed/C

ycle

0 200 400 600 800 1000 1200 1400

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Instr. Completed/Cycle L1D Cache Miss Rate

Blast, hmmpfam, and tcoffee performance graphs (from left to right)

Instructions per cycle increases in the same cycle that the L1 data miss rate decreases. We can thus correlate the performance of the application as it varies, with the system metrics impacting it.

Methodology Performance Analysis through live data

Cumulative Metrics

Separated Regions of Performance

Performance Analysis through live data

6.0e+8 8.0e+8 1.0e+9 1.2e+9 1.4e+9 1.6e+9 1.8e+9 2.0e+9 2.2e+9 2.4e+90

5e+4

1e+5

2e+5

2e+5

3e+5

3e+5

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

INST_TYPE_ARITH L1D_CACHE_MISS_USER

2.20e+9 2.22e+9 2.24e+9 2.26e+9 2.28e+9 2.30e+9 2.32e+9 2.34e+915000

20000

25000

30000

35000

40000

45000

50000

0

10000

20000

30000

40000

INST_TYPE_ARITH L1D_CACHE_MISS_USER

2.2e+9 2.4e+9 2.6e+9 2.8e+9 3.0e+9 3.2e+9 3.4e+9 3.6e+9 3.8e+9

0

1e+5

2e+5

3e+5

4e+5

5e+5

0

10000

20000

30000

40000

INST_TYPE_ARITH L1D_CACHE_MISS_USER

ClustalW region I (top), II (bottom left) and III (bottom right) showing differences in algorithmic complexity and memory access pattern.

Clustalw’s performance is roughly categorized into three regions.• Every sequence is compared against every other sequence by Smith Waterman, a quadratic time complexity dynamic programming algorithm,• The neighbor joining method in which comparison score of sequences is used to make a guide tree with the sequences at the leaves of the tree• The sequences are combined into a multiple sequence alignment according to the guide tree.

Instruction Profiling

0 500 1000 1500 2000 2500 3000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0.000

0.005

0.010

0.015

0.020

0.025

0.030

Instr. Completed/Cycle Branch Mispredicts/Instr.

0 500 1000 1500 2000 2500 3000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0.00

0.05

0.10

0.15

0.20

0.25

Instr. Completed/Cycle L1d Miss Rate

Clustalw’s livegraphs with L1d miss rate and branch mispredicts (top left and right).

Performance of the last phase of clustalw is more closely related to branch mispredicts than L1 data miss rate.

www.bioperf.org