Parallel Real -Time Scheduling from Theory to Implementationalp514/hpc2017/Jing_Li.pdf · Snippet...

Preview:

Citation preview

ParallelReal-TimeSchedulingfromTheorytoImplementation

Jing LiDepartmentofComputerScienceYingWuCollegeofComputing

New Jersey Institute of Technologyjingli@njit.edu

OutlineØ Why do we demand parallelism?

Ø My research: exploiting parallelism in real-time systems

Ø Two examples of my research

2

FutureofComputingPerformance

100

101

102

103

104

1970 1980 1990 2000 2010 2020

Frequency (MHz)

Year

40 Years of Microprocessor Trend Data

3

Originaldatauptotheyear2010collectedandplottedbyM.Horowitz,F.Labonte,O.Shacham,K.Olukotun, L.Hammond,andC.Batten.Newplotanddatacollectedfor2010-2015byK.Rupp.

ThefastestslothworkingattheDMV (from Zootopia).

PowerDensityØ Clock frequency hits the power wall.

4

Multi-CoreSystems

100

101

102

103

104

1970 1980 1990 2000 2010 2020

Number of

Logical Cores

Frequency (MHz)

Year

40 Years of Microprocessor Trend Data

5

Originaldatauptotheyear2010collectedandplottedbyM.Horowitz,F.Labonte,O.Shacham,K.Olukotun, L.Hammond,andC.Batten.Newplotanddatacollectedfor2010-2015byK.Rupp.

Multi-Core=Parallelism?Ø Concurrent execution of jobs

Ø Parallel execution of a job

6

coressinglemulti-coremachine

jobs Job 1 Job 2 Job 3

coressinglemulti-coremachine

jobs

LaneKeepingAssistSystem in Cars

Since the system interacts with the physical world, its computation must be completed under a time constraint.

7

Applications Benefit from Intra-Job ParallelismØ Motion planning program in a self-driving car

8

FromKimetal.[ICCPS13]

Cyber-Physical Systems(CPS)CPS are built from, and depend upon, the seamless integration of computational algorithms and physical components.

E.g., robotics, drones, autonomous vehicles, etc. 9

Applications Benefit from Intra-Job Parallelism

10

Searchtheweb

*Jeff Dean et al.(Google) "Thetailatscale."CommunicationsoftheACM56.2(2013)

2nd phaseranking

Snippetgenerator

doc

Doc.indexsearch

Response

Query

Ø Web searchNeed to respond within100msfor users to find responsive*.

InteractiveCloudServices(ICS)E.g., web search, online gaming, stock trading etc.

11

Searchtheweb

Real-Time SystemsThe performance of the systems depends not only upon theirfunctional aspects, but also upon their temporal aspects.

Real-time performance:

1) Provide hard guarantee of meeting jobs’ deadlines (e.g. CPS)2) Optimize latency-related objectives for jobs (e.g. ICS)

12

coressinglemulti-coremachine

jobs Job 1 Job 2 Job 3

NewGenerationofReal-TimeSystemsCharacteristics:

Ø New classes of applications with complex functionalitiesØ Increasing computational demand of each application

Ø Consolidating multiple applications onto a shared platform Ø Rapid increase in the number of cores per chip

Demand: leverage parallelism within the applications, to improve real-time performance and system efficiency

13

coressinglemulti-coremachine

jobs Job 1 Job 2 Job 3

Myresearch:parallelreal-timescheduling from theory to implementation

OutlineØ Why do we demand parallelism?

Ø My research: exploiting parallelism in real-time systems

Ø Example 1: parallel real-time scheduling for meeting deadlines

Ø Example 2: parallel real-time scheduling for a target latency

14

Real-TimeHybridSimulation(RTHS)

15

Since the numerical simulation interacts with the physical specimen, its computation must be completed by its deadline.

Cyber-PhysicalBoundary

^RobertL.andTerryL.BowenLargeScaleStructuresLaboratoryatPurdueUniversity

Numericalsimulation Physicalspecimen

HowtoAllocateCorestoMultipleParallelJobs?Ø A real world system consists of many parallel jobs

Ø Each job demands different real-time performanceq E.g., jobs need to meet their deadlines in RTHS

Ø The goal of parallel real-time scheduling:smartly allocate cores to parallel jobs to meet their deadlines

16

coressinglemulti-coremachine

jobs

HowtoAnalyzeaParallelJob?Ø An example:

Computation on a array A[i] with m elements

17

m=3

int i = 0;

parallel_for (; i < m; i++) {compute( A[i] );

}bar();

Naturally captures programs generated

by parallel languages such as Cilk Plus, Intel TBB and OpenMP.

Node: sequential computation

Edge: dependence between nodes

DirectedAcyclicGraph (DAG) Model

Unitnode– singleinstruction

18

Naturally captures programs generated

by parallel languages such as Cilk Plus, Intel TBB and OpenMP.

Node: sequential computation

Edge: dependence between nodes

DirectedAcyclicGraph (DAG) Model

available nodecompleted nodeunavailable node

Unitnode– singleinstruction

19

Naturally captures programs generated

by parallel languages such as Cilk Plus, Intel TBB and OpenMP.

Node: sequential computation

Edge: dependence between nodes

DirectedAcyclicGraph (DAG) Model

available nodecompleted nodeunavailable node

Unitnode– singleinstruction

20

Naturally captures programs generated

by parallel languages such as Cilk Plus, Intel TBB and OpenMP.

Node: sequential computation

Edge: dependence between nodes

DirectedAcyclicGraph (DAG) Model

available nodecompleted nodeunavailable node

Unitnode– singleinstruction

21

Naturally captures programs generated

by parallel languages such as Cilk Plus, Intel TBB and OpenMP.

Node: sequential computation

Edge: dependence between nodes

DirectedAcyclicGraph (DAG) Model

available nodecompleted nodeunavailable node

Unitnode– singleinstruction

22

Work T1 : execution time on one coreSpan T∞ : execution time on ∞ cores

(critical-path length)

Work and Span

T1 =18T∞ =9

23

FederatedSchedulingFor parallel tasks, FS has the best bound in term of schedulability

FS assigns ni dedicated cores to each parallel task

ni – the minimum #cores needed for a task to meet its deadline

cores

𝑛" =𝐶" − 𝐿"𝐷" − 𝐿"

deadlineDi =periodworst-case span Liworst-case workCi

24

tasks

FSPlatformØ Middleware platform providing FS service in Linux

Ø Work with GNU OpenMP runtime system Ø Run OpenMP programs with minimum modification

25

cores

Linux

FederatedScheduling(FS)

OpenMPRuntime

OpenMPRuntime

OpenMPRuntime

ParallelismImprovesRTHSAccuracy

26

A RTHS simulates a nine stories building, with first story damper

Ø Previously, sequential processing power limits a rate of 575HzØ Parallel execution now allows a rate of 3000Hz

Ø Reduction in error for acceleration and displacement

Ø Parallelism increases accuracy via faster actuation and sensing

Sequential (575Hz)Parallel(3000Hz)

Time(sec)

Normalize

dError(%)

OutlineØ Why do we demand parallelism?

Ø My research: exploiting parallelism in real-time systems

Ø Example 1: parallel real-time scheduling for meeting deadlines

Ø Example 2: parallel real-time scheduling for a target latency

27

SystemforInteractiveCloudServicesOnline system: do not know when jobs arrive

Objective: maximize the number of jobs that meet a target latency T

28

2nd phaseranking

Snippetgenerator

doc

Doc.indexsearch

Query

Aggregator

Aggregator

WorkloadDistributionHasaLongTail

29

Job SequentialExecutionTime(ms)(work)

Bing searchworkload

Ø Large jobs must run in parallel to meet target latency

Ø Always run large jobs in full parallelism?

Target latency

Parallelize Large Jobs According to LoadTail-Control Strategy: when load is low, run all jobs in parallel; when load is high, run large jobs sequentially.

Latency = Processing Time + Waiting time

At low load: processing time dominates latency

At high load:waiting time dominates latency

time

core1

core2

core3

✔Miss0 request

core1

core2

core3time

✔Miss1 request

30

target

target

TheInnerWorkingsofTail-ControlWe implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload.

31

We implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload.

TheInnerWorkingsofTail-Control

TargetLatency

32

defaultwork-stealing≥

TheInnerWorkingsofTail-ControlWe implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload.

TargetLatency

33

defaultwork-stealing≥

TheInnerWorkingsofTail-ControlWe implement tail-control algorithm in the runtime system of Intel Thread Building Block and evaluate on Bing search workload.

TargetLatency

34

defaultwork-stealing≥

ConclusionExploit the untapped efficiency in parallel computing platforms and drastically improve the real-time performance of applications.

Ø System Guaranteed to Meet Deadlines for CPSq Develop provably good schedulers for parallel applications

q Incorporate real-time scheduling into parallel runtime system

Ø System Optimized to Meet Target Latency for ICSq Design and implement strategy to optimize real-time performance

35

Recommended