Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Capıtulo 2: Programas Paralelos(Chapter 2: Parallel Programs)
Gustavo P. Alkmim ([email protected])
MO601 - Arquitetura de Computadores IIProf: Mario Cortes
Universidade Estadual de Campinas
14 de novembro de 2012
1 / 15
Sumario
I Introducao
I Exemplos de Concorrencia
I Processo de Paralelizacao
I Aplicacoes
I Conclusao
2 / 15
Introducao
I Ate 2004: Crescimento exponencialI Limitacoes
I Dissipacao de energiaI Avancos menos impactantes
I SolucaoI Many-coresI Multi-cores
I PP ImportanteI Arquitetos: Criar hardware eficienteI Programadores: Aprimorar o desempenho do Software
3 / 15
AplicacoesSimulacao de Correntes Oceanicas
I Objetivo: Simular o movimento das correntes oceanicasI Problema fısico contınuo (Espaco e no tempo)I Solucao computacional discretaI 2003 (NASA): 1,4 anos em 1 dia com 256 nucleos.
Parallel Programs
92 DRAFT: Parallel Computer Architecture 9/10/97
the ocean basin, each represented by a two-dimensional grid of points (see Figure 2-1). For sim-
plicity, the ocean is modeled as a rectangular basin and the grid points are assumed to be equally
spaced. Each variable is therefore represented by a separate two-dimensional array for each
cross-section through the ocean. For the time dimension, we discretize time into a series of !nite
time-steps. The equations of motion are solved at all the grid points in one time-step, the state of
the variables is updated as a result, and the equations of motion are solved again for the next
time-step, and so on repeatedly.
Every time-step itself consists of several computational phases. Many of these are used to set up
values for the different variables at all the grid points using the results from the previous time-
step. Then there are phases in which the system of equations governing the ocean circulation are
actually solved. All the phases, including the solver, involve sweeping through all points of the
relevant arrays and manipulating their values. The solver phases are somewhat more complex, as
we shall see when we discuss this case study in more detail in the next chapter.
The more grid points we use in each dimension to represent our !xed-size ocean, the !ner the
spatial resolution of our discretization and the more accurate our simulation. For an ocean such
as the Atlantic, with its roughly 2000km-x-2000km span, using a grid of 100-x-100 points
implies a distance of 20km between points in each dimension. This is not a very !ne resolution,
so we would like to use many more grid points. Similarly, shorter physical intervals between
time-steps lead to greater simulation accuracy. For example, to simulate !ve years of ocean
movement updating the state every eight hours we would need about 5500 time-steps.
The computational demands for high accuracy are large, and the need for multiprocessing is
clear. Fortunately, the application also naturally affords a lot of concurrency: many of the set-up
phases in a time-step are independent of one another and therefore can be done in parallel, and
the processing of different grid points in each phase or grid computation can itself be done in par-
allel. For example, we might assign different parts of each ocean cross-section to different pro-
cessors, and have the processors perform their parts of each phase of computation (a data-parallel
formulation).
(a) Cross-sections (b) Spatial discretization of a cross-section
Figure 2-1 Horizontal cross-sections through an ocean basin, and their spatial discretization into regular grids.
Figura: Discretizacao do espaco oceanico.
4 / 15
AplicacoesSimulacao de Evolucao das galaxias
I Objetivo: Simular a evolucao de galaxias, considerando asinteracoes com outros corpos
I Existencia de milhoes de estrelasI Solucao utilizando metodo de Barnes-Hut
(Aproximacao de Estrelas).I Distribuicao de Estrelas Irregular e dinamica.
Parallel Application Case Studies
9/10/97 DRAFT: Parallel Computer Architecture 93
2.2.2 Simulating the Evolution of Galaxies
Our second case study is also from scienti!c computing. It seeks to understand the evolution of
stars in a system of galaxies over time. For example, we may want to study what happens when
galaxies collide, or how a random collection of stars folds into a de!ned galactic shape. This
problem involves simulating the motion of a number of bodies (here stars) moving under forces
exerted on each by all the others, an n-body problem. The computation is discretized in space by
treating each star as a separate body, or by sampling to use one body to represent many stars.
Here again, we discretize the computation in time and simulate the motion of the galaxies for
many time-steps. In each time-step, we compute the gravitational forces exerted on each star by
all the others and update the position, velocity and other attributes of that star.
Computing the forces among stars is the most expensive part of a time-step. A simple method to
compute forces is to calculate pairwise interactions among all stars. This has O(n2) computa-
tional complexity for n stars, and is therefore prohibitive for the millions of stars that we would
like to simulate. However, by taking advantage of insights into the force laws, smarter hierarchi-
cal algorithms are able to reduce the complexity to O(n log n). This makes it feasible to simulate
problems with millions of stars in reasonable time, but only by using powerful multiprocessors.
The basic insight that the hierarchical algorithms use is that since the strength of the gravitational
interaction falls off with distance as , the in"uences of stars that are further away are
weaker and therefore do not need to be computed as accurately as those of stars that are close by.
Thus, if a group of stars is far enough away from a given star, then their effect on the star does not
have to be computed individually; as far as that star is concerned, they can be approximated as a
single star at their center of mass without much loss in accuracy (Figure 2-2). The further away
the stars from a given star, the larger the group that can be thus approxi mated. In fact, the strength
of many physical interactions falls off with distance, so hierarchical methods are becoming
increasingly popular in many areas of computing.
The particular hierarchical force-calculation algorithm used in our case study is the Barnes-Hut
algorithm. The case study is called Barnes-Hut in the literature, and we shall use this name for it
as well. We shall see how the algorithm works in Section 3.6.2. Since galaxies are denser in some
regions and sparser in others, the distribution of stars in space is highly irregular. The distribution
Gm1m2
r2
--------------
Star too close to approximate
Small group far enough away toapproximate by center of mass
Star on which forcesare being computed
Larger group farenough away to approximate
Figure 2-2 The insight used by hierarchical methods for n-body problems.
A group of bodies that is far enough away from a given body may be approximated by the center of mass of the group. The furtherapart the bodies, the larger the group that may be thus approximated.
Figura: Aproximacao de Estrelas
5 / 15
Processo de Paralelizacao
I ConceitosI Tarefa: Menor parte do programa que pode ser executada em
um processadorI Processo: Entidade capaz de processar uma tarefa (Abstracao)I Processador: Elemento fisico que realiza o processamento
The Parallelization Process
9/10/97 DRAFT: Parallel Computer Architecture 97
Given these concepts, the job of creating a parallel program from a sequential one consists of
four steps, illustrated in Figure 2-3:
1. Decomposition of the computation into tasks,
2. Assignment of tasks to processes,
3. Orchestration of the necessary data access, communication and synchronization among pro-
cesses, and
4. Mapping or binding of processes to processors.
Together, decomposition and assignment are called partitioning, since they divide the work done
by the program among the cooperating processes. Let us examine the steps and their individual
goals a little further.
Decomposition
Decomposition means breaking up the computation into a collection of tasks. For example, trac-
ing a single ray in Raytrace may be a task, or performing a particular computation on an individ-
ual grid point in Ocean. In general, tasks may become available dynamically as the program
executes, and the number of tasks available at a time may vary over the execution of the program.
The maximum number of tasks available at a time provides an upper bound on the number of
processes (and hence processors) that can be used effectively at that time. Hence, the major goal
in decomposition is to expose enough concurrency to keep the processes busy at all times, yet not
so much that the overhead of managing the tasks becomes substantial compared to the useful
work done.
P0
Tasks Processes Processors
MAPPING
ASSIGNMENT
DECOMPOSITION
SequentialComputation
Figure 2-3 Step in parallelization, and the relationships among tasks, processes and processors.
The decomposition and assignment phases are together called partitioning. The orchestration phase coordinates data access, com-munication and synchronization among processes, and the mapping phase maps them to physical processors.
P1
P2 P3
Parallel
p0 p1
p2 p3
p0 p1
p2 p3
ORCHESTRATION
Program
Partitioning
Figura: Processo de Paralelizacao
6 / 15
Processo de Paralelizacao
I Objetivo: Criar concorrenciaI Manter processadores ocupadosI Reduzir overhead
Parallel Programs
100 DRAFT: Parallel Computer Architecture 9/10/97
zontal extent therefore gives us a limit on the achievable speedup with unlimited number of pro-
cessors, which is thus simply the average concurrency available in the application over time. A
rewording of Amdahl�s law may therefore be:
Thus, if fk be the number of X-axis points in the concurrency pro!le that have concurrency k,
then we can write Amdahl�s Law as:
. (EQ 2.1)
It is easy to see that if the total work is normalized to 1 and a fraction s of this is serial,
then the speedup with an in!nite number of processors is limited by , and that with p pro-
cessors is limited by . In fact, Amdahl�s law can be applied to any overhead of paral-
lelism, not just limited concurrency, that is not alleviated by using more processors. For now, it
quanti!es the importance of exposing enough concurrency as a !rst step in creating a parallel
program.
0
2 0 0
4 0 0
6 0 0
8 0 0
1 0 0 0
1 2 0 0
1 4 0 0
150
219
247
286
313
343
380
415
444
483
504
526
564
589
633
662
702
733
Clock Cycle Number
Co
ncu
rren
cy
Figure 2-5 Concurrency pro!le for a distributed-time, discrete-event logic simulator.
The circuit being simulated is a simple MIPS R6000 microprocessor. The y-axis shows the number of logic elements available forevaluation in a given clock cycle.
SpeedupAreaUnderConcurencyProfile
HorizontalExtentofConcurrencyProfile----------------------------------------------------------------------------------------------------------≤
Speedup p( )
f kk
k 1=
∞
∑
f kk
p---
k 1=
∞
∑
---------------------------
#
≤
f kk
k 1=
∞
∑
1
s---
1
s1 s�
p-----------+
--------------------
Figura: Perfil de Concorrencia
7 / 15
Processo de Paralelizacao
8 COMP 422, Spring 2008 (V.Sarkar)
Example: Database Query Processing
The execution of the query can be divided into subtasks in various ways.
Each task can be thought of as generating an intermediate table of entries
that satisfy a particular clause.
Figura: Grafo de Dependencia de Tarefas
8 / 15
Processo de Paralelizacao
I Tecnicas de decomposicaoI Decomposicao RecursivaI Decomposicao de DadosI Decomposicao ExploratorioI Decomposicao Expeculativa
9 / 15
Processo de Paralelizacao
I Objetivo: distribuir tarefas entre os processosI Balancear cargaI Reduzir comunicacao e gerencia
I Baixa DependenciaI ArquiteturaI Modelo de Programacao
10 / 15
Processo de Paralelizacao
I Objetivo: Comunicarcao e sincronizacao entre os processosI Alta Dependencia
I ArquiteturaI Modelo de ProgramacaoI Linguagem de Programacao
I MPII OpenMPI Hıbrido
11 / 15
Processo de Paralelizacao
I Objetivo: Atribuir processos aos processadoresI Minimizar tempo de execucaoI maximizar utilizacao de recursosI Minimizar custos computacionais (overhead)
I SO ou Processo
12 / 15
Processo de ParalelizacaoClassificaClassificaçãção dos mo dos méétodos de escalonamentotodos de escalonamento
Scheduling
Local
DinâmicoEstático
Ótimo Sub-Ótimo
Aproximado Heurístico
DistribuídoNão Distribuído
Cooperativo Não Cooperativo
Sub-Ótimo
Aproximado Heurístico
Ótimo
Global
Figura: Taxonomia de Casavant e Kuhl
13 / 15
AplicacoesResolucao de PLI 0–1
I Objetivo: encontrar os valores das variaveis de saida queminimizem/maximizem a funcao objetivo
I Variaveis de saida binariasI Particionamento
I Decomposicao do espaco de solucoes.I Criacao de arvore de busca.
I ComunicacaoI Solucoes independentesI Envio de resultado para no mestre escolher o melhor resultado
14 / 15
Conclusao
I Alta concorrencia em aplicacoesI Importancia de estudo
I ArquitetosI Programadores
I Processo de Paralelizacao
I Evitar overhead
15 / 15