Cap tulo 2: Programas Paralelos (Chapter 2: Parallel Programs)cortes/mo601/trabalho_mo601/gustavo_prado_cap2/Slides-mo601.pdflik eto si mul a. Ho wv r , by king dv ntgeof ns ght heforc

Capıtulo 2: Programas Paralelos(Chapter 2: Parallel Programs)

Gustavo P. Alkmim ([email protected])

MO601 - Arquitetura de Computadores IIProf: Mario Cortes

Universidade Estadual de Campinas

14 de novembro de 2012

1 / 15

Sumario

I Introducao

I Exemplos de Concorrencia

I Processo de Paralelizacao

I Aplicacoes

I Conclusao

2 / 15

Introducao

I Ate 2004: Crescimento exponencialI Limitacoes

I Dissipacao de energiaI Avancos menos impactantes

I SolucaoI Many-coresI Multi-cores

I PP ImportanteI Arquitetos: Criar hardware eficienteI Programadores: Aprimorar o desempenho do Software

3 / 15

AplicacoesSimulacao de Correntes Oceanicas

I Objetivo: Simular o movimento das correntes oceanicasI Problema fısico contınuo (Espaco e no tempo)I Solucao computacional discretaI 2003 (NASA): 1,4 anos em 1 dia com 256 nucleos.

Parallel Programs

92 DRAFT: Parallel Computer Architecture 9/10/97

the ocean basin, each represented by a two-dimensional grid of points (see Figure 2-1). For sim-

plicity, the ocean is modeled as a rectangular basin and the grid points are assumed to be equally

spaced. Each variable is therefore represented by a separate two-dimensional array for each

cross-section through the ocean. For the time dimension, we discretize time into a series of !nite

time-steps. The equations of motion are solved at all the grid points in one time-step, the state of

the variables is updated as a result, and the equations of motion are solved again for the next

time-step, and so on repeatedly.

Every time-step itself consists of several computational phases. Many of these are used to set up

values for the different variables at all the grid points using the results from the previous time-

step. Then there are phases in which the system of equations governing the ocean circulation are

actually solved. All the phases, including the solver, involve sweeping through all points of the

relevant arrays and manipulating their values. The solver phases are somewhat more complex, as

we shall see when we discuss this case study in more detail in the next chapter.

The more grid points we use in each dimension to represent our !xed-size ocean, the !ner the

spatial resolution of our discretization and the more accurate our simulation. For an ocean such

as the Atlantic, with its roughly 2000km-x-2000km span, using a grid of 100-x-100 points

implies a distance of 20km between points in each dimension. This is not a very !ne resolution,

so we would like to use many more grid points. Similarly, shorter physical intervals between

time-steps lead to greater simulation accuracy. For example, to simulate !ve years of ocean

movement updating the state every eight hours we would need about 5500 time-steps.

The computational demands for high accuracy are large, and the need for multiprocessing is

clear. Fortunately, the application also naturally affords a lot of concurrency: many of the set-up

phases in a time-step are independent of one another and therefore can be done in parallel, and

the processing of different grid points in each phase or grid computation can itself be done in par-

allel. For example, we might assign different parts of each ocean cross-section to different pro-

cessors, and have the processors perform their parts of each phase of computation (a data-parallel

formulation).

(a) Cross-sections (b) Spatial discretization of a cross-section

Figure 2-1 Horizontal cross-sections through an ocean basin, and their spatial discretization into regular grids.

Figura: Discretizacao do espaco oceanico.

4 / 15

AplicacoesSimulacao de Evolucao das galaxias

I Objetivo: Simular a evolucao de galaxias, considerando asinteracoes com outros corpos

I Existencia de milhoes de estrelasI Solucao utilizando metodo de Barnes-Hut

(Aproximacao de Estrelas).I Distribuicao de Estrelas Irregular e dinamica.

Parallel Application Case Studies

9/10/97 DRAFT: Parallel Computer Architecture 93

2.2.2 Simulating the Evolution of Galaxies

Our second case study is also from scienti!c computing. It seeks to understand the evolution of

stars in a system of galaxies over time. For example, we may want to study what happens when

galaxies collide, or how a random collection of stars folds into a de!ned galactic shape. This

problem involves simulating the motion of a number of bodies (here stars) moving under forces

exerted on each by all the others, an n-body problem. The computation is discretized in space by

treating each star as a separate body, or by sampling to use one body to represent many stars.

Here again, we discretize the computation in time and simulate the motion of the galaxies for

many time-steps. In each time-step, we compute the gravitational forces exerted on each star by

all the others and update the position, velocity and other attributes of that star.

Computing the forces among stars is the most expensive part of a time-step. A simple method to

compute forces is to calculate pairwise interactions among all stars. This has O(n2) computa-

tional complexity for n stars, and is therefore prohibitive for the millions of stars that we would

like to simulate. However, by taking advantage of insights into the force laws, smarter hierarchi-

cal algorithms are able to reduce the complexity to O(n log n). This makes it feasible to simulate

problems with millions of stars in reasonable time, but only by using powerful multiprocessors.

The basic insight that the hierarchical algorithms use is that since the strength of the gravitational

interaction falls off with distance as , the in"uences of stars that are further away are

weaker and therefore do not need to be computed as accurately as those of stars that are close by.

Thus, if a group of stars is far enough away from a given star, then their effect on the star does not

have to be computed individually; as far as that star is concerned, they can be approximated as a

single star at their center of mass without much loss in accuracy (Figure 2-2). The further away

the stars from a given star, the larger the group that can be thus approxi mated. In fact, the strength

of many physical interactions falls off with distance, so hierarchical methods are becoming

increasingly popular in many areas of computing.

The particular hierarchical force-calculation algorithm used in our case study is the Barnes-Hut

algorithm. The case study is called Barnes-Hut in the literature, and we shall use this name for it

as well. We shall see how the algorithm works in Section 3.6.2. Since galaxies are denser in some

regions and sparser in others, the distribution of stars in space is highly irregular. The distribution

Gm1m2

r2

--------------

Star too close to approximate

Small group far enough away toapproximate by center of mass

Star on which forcesare being computed

Larger group farenough away to approximate

Figure 2-2 The insight used by hierarchical methods for n-body problems.

A group of bodies that is far enough away from a given body may be approximated by the center of mass of the group. The furtherapart the bodies, the larger the group that may be thus approximated.

Figura: Aproximacao de Estrelas

5 / 15

Processo de Paralelizacao

I ConceitosI Tarefa: Menor parte do programa que pode ser executada em

um processadorI Processo: Entidade capaz de processar uma tarefa (Abstracao)I Processador: Elemento fisico que realiza o processamento

The Parallelization Process

9/10/97 DRAFT: Parallel Computer Architecture 97

Given these concepts, the job of creating a parallel program from a sequential one consists of

four steps, illustrated in Figure 2-3:

1. Decomposition of the computation into tasks,

2. Assignment of tasks to processes,

3. Orchestration of the necessary data access, communication and synchronization among pro-

cesses, and

4. Mapping or binding of processes to processors.

Together, decomposition and assignment are called partitioning, since they divide the work done

by the program among the cooperating processes. Let us examine the steps and their individual

goals a little further.

Decomposition

Decomposition means breaking up the computation into a collection of tasks. For example, trac-

ing a single ray in Raytrace may be a task, or performing a particular computation on an individ-

ual grid point in Ocean. In general, tasks may become available dynamically as the program

executes, and the number of tasks available at a time may vary over the execution of the program.

The maximum number of tasks available at a time provides an upper bound on the number of

processes (and hence processors) that can be used effectively at that time. Hence, the major goal

in decomposition is to expose enough concurrency to keep the processes busy at all times, yet not

so much that the overhead of managing the tasks becomes substantial compared to the useful

work done.

P0

Tasks Processes Processors

MAPPING

ASSIGNMENT

DECOMPOSITION

SequentialComputation

Figure 2-3 Step in parallelization, and the relationships among tasks, processes and processors.

The decomposition and assignment phases are together called partitioning. The orchestration phase coordinates data access, com-munication and synchronization among processes, and the mapping phase maps them to physical processors.

P1

P2 P3

Parallel

p0 p1

p2 p3

p0 p1

p2 p3

ORCHESTRATION

Program

Partitioning

Figura: Processo de Paralelizacao

6 / 15


I Objetivo: Criar concorrenciaI Manter processadores ocupadosI Reduzir overhead

Parallel Programs

100 DRAFT: Parallel Computer Architecture 9/10/97

zontal extent therefore gives us a limit on the achievable speedup with unlimited number of pro-

cessors, which is thus simply the average concurrency available in the application over time. A

rewording of Amdahl�s law may therefore be:

Thus, if fk be the number of X-axis points in the concurrency pro!le that have concurrency k,

then we can write Amdahl�s Law as:

. (EQ 2.1)

It is easy to see that if the total work is normalized to 1 and a fraction s of this is serial,

then the speedup with an in!nite number of processors is limited by , and that with p pro-

cessors is limited by . In fact, Amdahl�s law can be applied to any overhead of paral-

lelism, not just limited concurrency, that is not alleviated by using more processors. For now, it

quanti!es the importance of exposing enough concurrency as a !rst step in creating a parallel

program.

0

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0

1 2 0 0

1 4 0 0

150

219

247

286

313

343

380

415

444

483

504

526

564

589

633

662

702

733

Clock Cycle Number

Co

ncu

rren

cy

Figure 2-5 Concurrency pro!le for a distributed-time, discrete-event logic simulator.

The circuit being simulated is a simple MIPS R6000 microprocessor. The y-axis shows the number of logic elements available forevaluation in a given clock cycle.

SpeedupAreaUnderConcurencyProfile

HorizontalExtentofConcurrencyProfile----------------------------------------------------------------------------------------------------------≤

Speedup p( )

f kk

k 1=

∞

∑

f kk

p---

k 1=

∞

∑

---------------------------

#

≤

f kk

k 1=

∞

∑

1

s---

1

s1 s�

p-----------+

--------------------

Figura: Perfil de Concorrencia

7 / 15


8 COMP 422, Spring 2008 (V.Sarkar)

Example: Database Query Processing

The execution of the query can be divided into subtasks in various ways.

Each task can be thought of as generating an intermediate table of entries

that satisfy a particular clause.

Figura: Grafo de Dependencia de Tarefas

8 / 15


I Tecnicas de decomposicaoI Decomposicao RecursivaI Decomposicao de DadosI Decomposicao ExploratorioI Decomposicao Expeculativa

9 / 15


I Objetivo: distribuir tarefas entre os processosI Balancear cargaI Reduzir comunicacao e gerencia

I Baixa DependenciaI ArquiteturaI Modelo de Programacao

10 / 15


I Objetivo: Comunicarcao e sincronizacao entre os processosI Alta Dependencia

I ArquiteturaI Modelo de ProgramacaoI Linguagem de Programacao

I MPII OpenMPI Hıbrido

11 / 15


I Objetivo: Atribuir processos aos processadoresI Minimizar tempo de execucaoI maximizar utilizacao de recursosI Minimizar custos computacionais (overhead)

I SO ou Processo

12 / 15

Processo de ParalelizacaoClassificaClassificaçãção dos mo dos méétodos de escalonamentotodos de escalonamento

Scheduling

Local

DinâmicoEstático

Ótimo Sub-Ótimo

Aproximado Heurístico

DistribuídoNão Distribuído

Cooperativo Não Cooperativo

Sub-Ótimo

Aproximado Heurístico

Ótimo

Global

Figura: Taxonomia de Casavant e Kuhl

13 / 15

AplicacoesResolucao de PLI 0–1

I Objetivo: encontrar os valores das variaveis de saida queminimizem/maximizem a funcao objetivo

I Variaveis de saida binariasI Particionamento

I Decomposicao do espaco de solucoes.I Criacao de arvore de busca.

I ComunicacaoI Solucoes independentesI Envio de resultado para no mestre escolher o melhor resultado

14 / 15

Conclusao

I Alta concorrencia em aplicacoesI Importancia de estudo

I ArquitetosI Programadores

I Processo de Paralelizacao

I Evitar overhead

15 / 15

Documents

Cap tulo 2: Programas Paralelos (Chapter 2: Parallel Programs)cortes/mo601/trabalho_mo601/gustavo_prado_cap2/Slides-mo601.pdflik eto si mul a. Ho wv r , by king dv ntgeof ns ght heforc