View
13
Download
0
Category
Preview:
Citation preview
Threads: either under- or over-utilised
• Underutilised: limited by creation speed of work
– Cannot exploit all the CPUs even though there is more work
• Overutilised: losing performance due to context switches
– There is overhead when switching between OS threads
– Each thread needs to warm up cache again
– Increases memory pressure
• Worst case: continual slow-down
– The cost of creating threads is partially borne by kernel
– User code may slow down more than kernel code under load
– Number of workers slowly goes up; completion rate goes down
HPCE / dt10/ 2015 / 7.1
Solving under-utilisation
template<class TI, class TF>
void parallel_for(const TI &begin, const TI &end, const TF &f)
{
if(begin+1 == end){
f(begin);
}else{
TI mid=(begin+end)/2;
std::thread left( // Spawn the left thread in parallel
[&](){ parallel_for(begin, mid, f); }
);
// Perform the right segment on our thread
parallel_for(mid, end, f);
// wait for the left to finish
left.join();
}
}
HPCE / dt10/ 2015 / 7.2
Creation of work using trees
• Tree starts on one thread [0..4)
template<class TI, class TF>
void parallel_for(
const TI &begin,const TI &end,
const TF &f)
{
if(begin+1 == end){
f(begin);
}else{
TI mid=(begin+end)/2;
std::thread left(
[&](){ parallel_for(begin, mid, f); }
);
parallel_for(mid, end, f);
left.join();
}
}
HPCE / dt10/ 2015 / 7.3
Creation of work using trees
• Tree starts on one thread
• Create thread to branch
template<class TI, class TF>
void parallel_for(
const TI &begin,const TI &end,
const TF &f)
{
if(begin+1 == end){
f(begin);
}else{
TI mid=(begin+end)/2;
std::thread left(
[&](){ parallel_for(begin, mid, f); }
);
parallel_for(mid, end, f);
left.join();
}
}
[0..4)
[2..4) [0..2)
spawn
HPCE / dt10/ 2015 / 7.4
Creation of work using trees
• Tree starts on one thread
• Create thread to branch
template<class TI, class TF>
void parallel_for(
const TI &begin,const TI &end,
const TF &f)
{
if(begin+1 == end){
f(begin);
}else{
TI mid=(begin+end)/2;
std::thread left(
[&](){ parallel_for(begin, mid, f); }
);
parallel_for(mid, end, f);
left.join();
}
}
[0..4)
[2..4)
[3..4)
[0..2)
[2..3) [1..2) [0..1)
spawn
spawn spawn
HPCE / dt10/ 2015 / 7.5
Creation of work using trees
• Tree starts on one thread
• Create thread to branch
• Execute function at leaves
template<class TI, class TF>
void parallel_for(
const TI &begin,const TI &end,
const TF &f)
{
if(begin+1 == end){
f(begin);
}else{
TI mid=(begin+end)/2;
std::thread left(
[&](){ parallel_for(begin, mid, f); }
);
parallel_for(mid, end, f);
left.join();
}
}
[0..4)
[2..4)
[3..4)
[0..2)
[2..3) [1..2) [0..1)
f(3) f(2) f(1) f(0)
spawn
spawn spawn
HPCE / dt10/ 2015 / 7.6
Creation of work using trees
• Tree starts on one thread
• Create thread to branch
• Execute function at leaves
• Join back up to the root
template<class TI, class TF>
void parallel_for(
const TI &begin,const TI &end,
const TF &f)
{
if(begin+1 == end){
f(begin);
}else{
TI mid=(begin+end)/2;
std::thread left(
[&](){ parallel_for(begin, mid, f); }
);
parallel_for(mid, end, f);
left.join();
}
}
[0..4)
[2..4)
[3..4)
[0..2)
[2..3) [1..2) [0..1)
f(3) f(2) f(1) f(0)
[3..4) [2..3) [1..2) [0..1)
[2..4) [0..2)
[0..4)
spawn
join
join join
spawn spawn
HPCE / dt10/ 2015 / 7.7
Properties of fork / join trees
• Recursively creating trees of work is very efficient
– We are not limited to one thread creating all tasks
– Exponential rather than linear growth of threads with time
• Problem solved?
• Growth of threads is exponential with time
• Can put significant pressure on the OS thread scheduler
– Context switching 1000s of threads is very inefficient
• Each thread requires significant resources
– Need kernel handles, stack, thread-info block, ...
– Can’t allocate more than a few thousand threads per process
HPCE / dt10/ 2015 / 7.8
Re-examining the goals
• What we want is parallel_for:
“Iterations may execute in parallel”
• std::thread gives us something different:
“The new thread will execute in parallel”
• Our thread based strategy is too eager to go parallel
• We want to go just parallel enough, then stay serial
HPCE / dt10/ 2015 / 7.9
Tasks versus threads
• A task is a chunk of work that can be executed
– A task may execute in parallel with other tasks
– A task will eventually be executed, but no guarantee on when
• Tasks are scheduled and executed by a run-time (TBB)
– Maintain a list of tasks which are ready to run
– Have one thread per CPU for running tasks
– If a thread is idle, assign a task from the ready queue to it
– No limit on number of tasks which are ready to run
– (OS is still responsible for mapping threads to CPUs)
• TBB has a number of high-level ways to use tasks
– But there is a single low-level underlying task primitive HPCE / dt10/ 2015 / 7.10
Overview of task groups
• A task group collects together a number of child tasks
– The task creating the group is called the parent
– One or more child tasks are created and run() by the parent
– Child tasks may execute in parallel
– Parent task must wait() for all child tasks before returning
HPCE / dt10/ 2015 / 7.11
parallel_for using tbb::task_group
#include "tbb/task_group.h"
template<class TI, class TF>
void parallel_for(const TI &begin, const TI &end, const TF &f)
{
if(begin+1 == end){
f(begin);
}else{
auto left=[&](){ parallel_for(begin, (begin+end)/2, f); }
auto right=[&](){ parallel_for((begin+end)/2, end, f); }
// Spawn the two tasks in a group
tbb::task_group group;
group.run(left);
group.run(right);
group.wait(); // Wait for both to finish
}
}
HPCE / dt10/ 2015 / 7.12
Overview of task groups
• A task group collects together a number of child tasks
– The task creating the group is called the parent
– One or more child tasks are created and run() by the parent
– Child tasks may execute in parallel
– Parent task must wait() for all child tasks before returning
• Some important differences between tasks and threads
– Threads must execute in parallel
– A thread may continue after its creator exits
– Threads must be joined individually
HPCE / dt10/ 2015 / 7.13
More patterns: tbb::parallel_invoke
template<typename Func0, typename Func1>
void parallel_invoke(const Func0& f0, const Func1& f1);
template<typename Func0, typename Func1, typename Func2>
void parallel_invoke(const Func0& f0, const Func1& f1, const Func2& f2);
• Takes two or more functions and may run in parallel
– Overloaded for different numbers of arguments
– No overload for 1 argument for obvious reasons
• Interface is very clean, but also quite simple
– Decision about number of tasks is completely static
– You can’t add more tasks once some starts
– No choice about when to synchronise with tasks
HPCE / dt10/ 2015 / 7.14
parallel_invoke using task_group
• parallel_invoke can be implement using task_group
– task_group supports a super-set of the functionality
template<typename Fc0, typename Fc1, typename Fc2>
void parallel_invoke(const Fc0& f0, const Fc1& f1, const Fc2& f2)
{
tbb::task_group group;
group.run(f0);
group.run(f1);
group.run(f2);
group.wait();
}
HPCE / dt10/ 2015 / 7.15
Can’t do task_group using parallel_invoke
• task_group is intrinsically dynamic
– Decide how much work to add at run-time
– Can add work even while tasks are running in the group
void my_function(int n, float *x)
{
tbb::task_group group;
for(unsigned i=0;i<n;i++){
if(x[i]==0)
group.run([=](){ f(i); });
else
group.run([=](){ g(x[i]); });
}
group.wait();
}
HPCE / dt10/ 2015 / 7.16
The underlying primitive: tbb::task
• TBB has a basic primitive called tbb::task
• This is the raw unit of scheduling understood by the lib.
– Other high-level wrappers create tasks internally
– The TBB run-time takes tasks and schedules them to a CPU
• Tasks are very flexible, with a lot of power
– Can express complicate dependency graphs
– Build non-local synchronisation and barriers
• With power comes responsibility
– They allow you to make mistakes
– Possible (though not likely) to mess up the TBB run-time
• Better to create wrappers on top that hide tasks
– parallel_for, parallel_invoke, task_group, parallel_reduce, ...
HPCE / dt10/ 2015 / 7.17
class MyTask
: public tbb::task
{
int start, end;
MyTask(int _start, int _end)
{ start=_start; end=_end; }
tbb::task * execute()
{
if(cond())
return 0;
set_ref_count(3);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
spawn(t1);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
}
};
void CreateTasks(int start, int end)
{
MyTask &root=*new(allocate_root()) MyTask(start,end);
tbb::task::spawn_root_and_wait();
}
void MyTask(int start, int end)
{
if(cond())
return 0;
tbb::task_group group;
group.run([=](){MyTask(start,(start+end)/2); });
group.run([=](){MyTask((start+end)/2,end); });
DoSomethingFirst();
group.wait();
DoSomethingElse();
return 0;
}
HPCE / dt10/ 2015 / 7.18
Life-cycle of a task
• Life-cycle of task due to interaction between task and run-
time
– Individual task calls spawn, wait_for_all (sync), return
– TBB run-time will keep track of a task’s children (dependencies)
Scheduling through reference counts
• Each task has a reference count and a successor task
• The reference count identifies whether a task is blocked
– If the reference count is zero then the task could be run
– But only if it has been given to the task scheduler
– Legal to create a task and not give it to the scheduler
– Note the difference: “reference count” vs “C++ reference”
• Successor task identifies the task blocked by this task
– Generally the successor task is the creator, or parent
– When a task completes it decrements the count of its successor
HPCE / dt10/ 2015 / 7.20
tbb::task
refcount
MyTask
0
start ...
end ...
successor -
Created
void CreateTasks(int start, int end)
{
MyTask &root=*new(allocate_root()) MyTask(start,end);
tbb::task::spawn_root_and_wait();
}
tbb::task
refcount
MyTask
0
start ...
end ...
successor -
Running
void CreateTasks(int start, int end)
{
MyTask &root=*new(allocate_root()) MyTask(start,end);
tbb::task::spawn_root_and_wait();
}
tbb::task
refcount
MyTask
0
start ...
end ...
successor -
Running
tbb::task * MyTask::execute()
{
if(cond())
return 0;
set_ref_count(3);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
tbb::task
refcount
MyTask
3
start ...
end ...
successor -
Running
tbb::task * MyTask::execute()
{
if(cond())
return 0;
set_ref_count(3);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
tbb::task
refcount
MyTask
3
start ...
end ...
successor -
Running
tbb::task
refcount
MyTask
0
start ...
end ...
Created
tbb::task * MyTask::execute()
{
if(cond())
return 0;
set_ref_count(3);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
successor
tbb::task
refcount
MyTask
3
start ...
end ...
successor -
Running
tbb::task
refcount
MyTask
0
start ...
end ...
Created
tbb::task * MyTask::execute()
{
if(cond())
return 0;
set_ref_count(3);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
tbb::task
refcount
MyTask
0
start ...
end ...
Created
successor successor
tbb::task
refcount
MyTask
3
start ...
end ...
successor -
Running
tbb::task
refcount
MyTask
0
start ...
end ...
Ready
tbb::task * MyTask::execute()
{
if(cond())
return 0;
set_ref_count(3);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
tbb::task
refcount
MyTask
0
start ...
end ...
Ready
successor successor
tbb::task
refcount
MyTask
3
start ...
end ...
successor -
Running
tbb::task
refcount
MyTask
0
start ...
end ...
Running
tbb::task * MyTask::execute()
{
if(cond())
return 0;
set_ref_count(3);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
tbb::task
refcount
MyTask
0
start ...
end ...
Ready
successor successor
tbb::task
refcount
MyTask
2
start ...
end ...
successor -
Blocked
tbb::task
refcount
MyTask
0
start ...
end ...
Running
tbb::task * MyTask::execute()
{
if(cond())
return 0;
set_ref_count(3);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
tbb::task
refcount
MyTask
0
start ...
end ...
Ready
successor successor
tbb::task
refcount
MyTask
2
start ...
end ...
successor -
Blocked
tbb::task
refcount
MyTask
0
start ...
end ...
Running
tbb::task * MyTask::execute()
{
if(cond())
return 0;
set_ref_count(3);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
tbb::task
refcount
MyTask
0
start ...
end ...
Ready
successor successor
tbb::task
refcount
MyTask
2
start ...
end ...
successor -
Blocked
tbb::task
refcount
MyTask
0
start ...
end ...
Running
tbb::task * MyTask::execute()
{
if(cond())
return 0;
set_ref_count(3);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
tbb::task
refcount
MyTask
0
start ...
end ...
Running
successor successor
tbb::task
refcount
MyTask
1
start ...
end ...
successor -
Blocked
tbb::task
refcount
MyTask
0
start ...
end ...
Running
tbb::task * MyTask::execute()
{
if(cond())
return 0;
set_ref_count(3);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
tbb::task
refcount
MyTask
0
start ...
end ...
Finished
successor successor
tbb::task
refcount
MyTask
0
start ...
end ...
successor -
Running
tbb::task
refcount
MyTask
0
start ...
end ...
Finished
tbb::task * MyTask::execute()
{
if(cond())
return 0;
set_ref_count(3);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
successor
tbb::task
refcount
MyTask
0
start ...
end ...
successor -
Running
tbb::task * MyTask::execute()
{
if(cond())
return 0;
set_ref_count(3);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
tbb::task
refcount
MyTask
0
start ...
end ...
successor -
Finished
void CreateTasks(int start, int end)
{
MyTask &root=*new(allocate_root()) MyTask(start,end);
tbb::task::spawn_root_and_wait();
}
Managing reference counts
• What happens if we get the reference count wrong?
• Finishing task calls decrement_ref_count on
successor
– Automatically returns task to scheduler if count becomes zero
HPCE / dt10/ 2015 / 7.36
tbb::task
refcount
MyTask
1
start ...
end ...
successor -
Running
tbb::task * MyTask::execute()
{
if(cond())
return 0;
set_ref_count(1);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
tbb::task
refcount
MyTask
0
start ...
end ...
successor -
Running
tbb::task
refcount
MyTask
0
start ...
end ...
Finished
tbb::task * MyTask::execute()
{
if(cond())
return 0;
set_ref_count(1);
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
tbb::task
refcount
MyTask
0
start ...
end ...
Ready
successor successor
tbb::task
refcount
MyTask
0
start ...
end ...
successor -
Running
tbb::task * MyTask::execute()
{
if(cond())
return 0;
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
set_ref_count(3);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
tbb::task
refcount
MyTask
0
start ...
end ...
Running
successor
tbb::task
refcount
MyTask
-1
start ...
end ...
successor -
Running
tbb::task
refcount
MyTask
0
start ...
end ...
Finished
tbb::task * MyTask::execute()
{
if(cond())
return 0;
MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);
MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);
spawn(t1);
spawn(t2);
set_ref_count(3);
DoSomethingFirst();
wait_for_all();
DoSomethingElse();
return 0;
}
tbb::task
refcount
MyTask
0
start ...
end ...
Ready
successor successor
Some help is available
• TBB library comes in two forms: debug and release
– release library does no error checking – all about speed
– debug library will check reference counts at many points
• Choose library version at compilation and link stages
– Debug: #define TBB_USE_DEBUG=1 when compiling
• On microsoft compilers it will automatically link the correct library
• On other compilers use “-ltbb” vs “-ltbb_debug”
– Usually maintain different release and debug settings
• Debug: /DTBB_USE_DEBUG=1 /MDd
• Release: /DNDEBUG=1 /O2
– Can setup in Visual Studio or in a makefile
• Generally: try to avoid raw tasks if possible HPCE / dt10/ 2015 / 7.41
Many design patterns are built on tasks
• Iteration in various forms
– parallel_for, parallel_for_each
• Reduction and accumulation
– parallel_reduce
• Data-dependent looping and queue processing
– parallel_do
• Support for heterogeneous tasks
– parallel_invoke, task_group
• Heterogeneous tasks and token-based data-flow
– parallel_pipeline
• Goal: turn design patterns in concrete functions
HPCE / dt10/ 2015 / 7.42
Recommended