Rethinking Parallel Execution

Rethinking Parallel Execution

Guri Sohi(along with Matthew Allen, Srinath

Sridharan, Gagan Gupta)University of Wisconsin-Madison

Outline

• From sequential to multicore• Reminiscing: Instruction Level Parallelism (ILP)• Canonical parallel processing and execution• Rethinking canonical parallel execution• Dynamic Serialization• Consequences of Dynamic Serialization• Wrap up

April 27, 2010 Mason Wells 2

Microprocessor Generations

• Generation 1: Serial• Generation 2: Pipelined• Generation 3: Instruction-level Parallel (ILP)• Generation 4: Multiple processing cores


Microprocessor Generations


Gen 1: Sequential (1970s) Gen 2: Pipelined (1980s)

Gen 3: ILP (1990s)

Gen 4: Multicore (2000s)

5

From One Generation to Next

• Significant debate and research – New solutions proposed– Old solutions adapt in interesting ways to become

viable or even better than new solutions

• Solutions that involve changes “under the hood” end up winning over others

6


• From Sequential to Pipelined– RISC (MIPS, Sun SPARC, Motorola 88k, IBM PowerPC)

vs. CISC (Intel x86)– CISC architectures learned and employed RISC

innovations

• From Pipelined to Instruction-Level Parallel– Statically scheduled VLIW/EPIC– Dynamically scheduled superscalar

7


• From ILP to Multicore– Parallelism based upon canonical parallel execution

model– Overcome constraints to canonical parallelization• Thread-level speculation (TLS)• Transactional memory (TM)

Reminiscing about ILP• Late 1980s to mid 1990s• Search for “post RISC” architecture– More accurately, instruction processing model

• Desire to do more than one instruction per cycle—exploit ILP

• Majority school of thought: VLIW/EPIC• Minority: out-of-order (OOO) superscalar

8

VLIW/EPIC School• Parallel execution requires a parallel ISA• Parallel execution determined statically (by

compiler)• Parallel execution expressed in static program• Take program/algorithm parallelism and mold it to given

execution schedule for exploiting parallelism

9

VLIW/EPIC School• Creating effective parallel representations

(statically) introduces several problems– Predication– Statically scheduling loads– Exception handling– Recovery code

• Lots of research addressing these problems• Intel and HP pushed it as their future (Itanium)

10

OOO Superscalar• Create dynamic parallel execution from

sequential static representation– dynamic dependence information accurate– execution schedule flexible

• None of the problems associated with trying to create a parallel representation statically

• Natural growth path with no demands on software

11

Lessons from ILP Generation• Significant consequences of trying to statically

detect and express parallelism• Techniques that make “under the hood” changes

are the winners– Even though they may have some

drawbacks/overheads

12

The Multicore Generation• How to achieve parallel execution on multiple

processors?• Solution critical to the long-term health of the

computer and information technology industry• And thus the economy and society as we know it

13

14

15

16

The Multicore Generation• How to achieve parallel execution on multiple

processors?• Over four decades of conventional wisdom in

parallel processing– Mostly in the scientific application/HPC arena– Use this as basis

Parallel Execution Requires a Parallel Representation

17

Canonical Parallel Execution ModelA: Analyze program to identify independence in

program– independent portions executed in parallel

B: Create static representation of independence– synchronization to satisfy independence assumption

C: Dynamic parallel execution unwinds as per static representation– potential consequences due to static assumptions

18

Canonical Parallel Execution Model• Like VLIW/EPIC, canonical model creates a variety

of problems that have lead to a vast body of research– identifying independence– creating static representation– dynamic unwinding

19

Identifying Independence

• Static program analysis– Over four decades of work

• Hard to identify statically– Inherently dynamic properties– Must be conservative statically

• Need to identify dependence in order to identify independence


Creating Static Representation

• Parallel representation for guaranteed independent work

• Insert synchronization for potential dependences– Conservative synchronization moves parallel

execution towards sequential execution


Dynamic Unwinding

• Non-determinism– Changes to program state may not be repeatable

• Race conditions• Several startup companies to deal with this

problem


Conventional Wisdom

Parallel Execution Requires a Parallel Representation

Consequences:• Must create parallel representation• For correct execution, must statically identify:– Independence for parallel representation– Dependence for synchronization

• Source of enormous difficulty and complexity– Generally functions of input to program– Inherently dynamic properties


Current Approaches

• Stick with canonical model and try to overcome limitations

• Thread Level Speculation (TLS) and Transactional Memory (TM)

• Techniques to allow programmer to program sequentially but automatically generate parallel representation

• Techniques to handle non-determinism and race conditions.


TLS and TM

• Overcome major constraint to creating static parallel representation

• Likely in several upcoming microprocessors– Our work in mid 1990s will be key enabler• Already in Sun MAJC, NEC Merlot, Sun Rock


Static Program RepresentationIssues Sequential ParallelBugs Yes Yes (more)

Data races No Yes

Locks/Synch No Yes

Deadlock No Yes

Nondeterminism No Yes

Parallel Execution ? Yes


• Can we get parallel execution without a parallel representation? Yes

• Can dynamic parallelization extract parallelism that is inaccessible to static methods? Yes

Serialization Sets: What?

• Sequential program representation and dynamic parallel execution– No static representation of independence– No locks and no explicit synchronization

• “Under the hood” run time system dynamically determines and orders dependent computations– Independence and thus parallelism falls out as a side

• Comparable or better performance than conventional parallel models


How? Big Picture• Write program in well object-oriented style

– Method operates on data of associated object (ver. 1)• Identify parts of program for potential parallel

execution– Make suitable annotations as needed

• Dynamically determine data object touched by selected code– Identify dependence

• Program thread assigns selected code to bins


How? Big Picture• Serialize computations to same object

– Enforce dependence– Assign them to same bin; delegate thread executes

computations in same bin sequentially• Do not look for/represent independence

– Falls out as an effect of enforcing dependence– Computations in different bins execute in parallel

• Updates to given state in same order as in sequential program– Determinism– No races– If sequential correct; parallel execution is correct (same input)


Big Picture

30

Program Thread

Delegate Thread 0

Delegate Thread 2

Delegate Thread 1

Serialization Sets: How?• Sequential program with annotations

– Identify potentially independent methods– Associate a serializers with objects to express dependence

• Serializer groups dependent method invocations into a serialization set– Runtime executes in order to honor dependences

• Independent method invocations in different sets– Runtime opportunistically parallelizes execution


Example: Debit/Credit Transactions


trans_t* trans;while ((trans = get_trans ()) != NULL) {

account_t* account = trans->account;

if (trans->type == DEPOSIT) account->deposit (trans->amount);

else if (trans->type == WITHDRAW) account->withdraw (trans->amount);}

Several static unknowns!

# of transactions?

Points to?

Loop-carried dependence?

Multithreading Strategy


trans_t* trans;while ((trans = get_trans ()) != NULL) {

account_t* account = trans[i]->account;

if (trans->type == DEPOSIT) account->deposit (trans->amount);

else if (trans->type == WITHDRAW) account->withdraw (trans->amount);}

1) Read all transactions into an array2) Divide chunks of array among multiple threads

Oblivious to what accounts each thread may access!→ Methods must lock account to→ ensure mutual exclusion

private <account_t> private_account_t;

begin_nest ();trans_t* trans;while ((trans = get_trans ()) != NULL) { private_account_t* account = trans->account;

if (trans->type == DEPOSIT) account->delegate(deposit, trans->amount);

else if (trans->type == WITHDRAW) account->delegate(withdraw, trans->amount);}end_nest ();

End nesting level, implicit barrier

Example with Serialization Sets


Declare wrapped account type

Initiate nesting level

Delegate indicates potentially-independent operations

At execution, delegate:1)Creates method invocation structure2)Gets serializer pointer from base class3)Enqueues invocation in serialization set

delegate

delegate

delegate

delegate

delegate

delegate

delegate

delegate


depositacct=100$2000

SS #100 SS #200 SS #300

withdrawacct=300

$350

withdrawacct=200$1000withdraw

acct=100$50


withdrawacct=100

$20

withdrawacct=200$1000

depositacct=100

$300

Program context

Delegate context

Program thread

Delegate threadsProgram context



SS #100 SS #200 SS #300

withdrawacct=300

$350

withdrawacct=200$1000withdraw

acct=100$50


withdrawacct=100

$20


depositacct=100

$300

Delegate contextDelegate 0 Delegate 1depositacct=100$2000

withdrawacct=100

$50

withdrawacct=100

$20

depositacct=100

$300


withdrawacct=300

$350



delegate

delegate

delegate

delegate

delegate

delegate

delegate

delegate

Race-free, determinate execution without synchronization!

Prometheus: C++ Library for SS

• Template library– Compile-time instantiation of SS data structures– Metaprogramming for static type checking

• Runtime orchestrates parallel execution

• Portable– x86, x86_64, SPARC V9– Linux, Solaris


Prometheus Runtime

• Version 1.0– Dynamically extracts parallelism– Statically scheduled– No nested parallelism

• Version 2.0– Dynamically extracts parallelism– Dynamically scheduled• Work-stealing scheduler

– Supports nested parallelism


Network Packet Classification

39

packet_t* packet;classify_t* classifier;vector<int> ruleCount(num_rules);Vector<packet_queue_t> packet_queues;int packetCount = 0;

for(i=0;i<packet_queues.size();i++){

while ((packet = packet_queues[i].get_pkt()) != NULL){ruleID = classifier->softClassify (packet);ruleCount[ruleID]++;packetCount++;

}}

Example with Serialization Sets

40

Private <classify_t> private_classify_t;vector<private_classify_t> classifiers;int packetCount = 0;vector<int> ruleCount(numRules,0);int size = packet_queues.size();begin_nest ();for (i=0;i<size;i++){

classifiers[i].delegate (&classifier_t::softClassify,

packet_queues[i]);}end_nest ();

for(i=0;i<size;i++){ruleCount += classifier[i].getRuleCount();packetCount += classifier[i].getPacketCount();

}

Packet Classification(No Locks!)

41

Network Intrusion Detection• Very common networking application• Most common program used: Snort– Open source version (like Linux)– But also commercial versions (Sourcefire)

• Basic structure of computation also found in many other deep packet inspection applications– E.g., packet de-duplication (Riverbed)


Other Applications• Benchmarks

– Lonestar, NU-MineBench, PARSEC, Phoenix

• Conventional Parallelization– pthreads, OpenMP

• Prometheus versions– Port program to sequential C++ program– Idiomatic C++: OO, inheritance, STL– Parallelize with serialization sets


Statically Scheduled Results


4 Socket AMD Barcelona (4-way multicore) = 16 total cores

Statically Scheduled Results


Summary• Sequential program with annotations– No explicit synchronization, no locks

• Programmers focus on keeping computation private to object state– Consistent with OO programming practices

• Dependence-based model– Determinate race-free parallel execution

• Do as well or better than incumbents but without their negatives

• Can do things that are very hard for incumbents


Documents

Rethinking Parallel Execution