31
TokenNets: An Approach to Programming Highly Parallel Measurement Science and Signal Processing Lee Barford Kevin Mitchell Tom Vandeplas January 7, 2010 1 Introduction All industries that provide value-added through the use of software are facing a “multicore crisis.” Clock rates of central processing units (CPUs) stopped rising at their former expo- nential rate in about 2004. Instead, processor manufacturers are increasing the number of parallel execution units or cores provided per CPU. The number of cores is now increasing exponentially. In order for product performance to grow with improvements in processors, it will be necessary to program in such a way that this ever-growing number of parallel processors is used effectively. However, for the sort of algorithms typically found in applications where the computa- tional complexity lies in signal processing and measurement science algorithms, currently available programming tools will fall short in exploiting the numbers of cores soon to be available. Gustafson’s Law gives an straightforward (but approximate) way of exploring the limits of performance improvements available from parallelization. The Law can be written T P = T ser + T par P , (1) where T P is the time required for a given piece of code to run on P processors, T ser is the time spent in serial (un-parallelized) code, and T par is the time spent in the parallelized portion. In order to maximize performace as P increases, the serial fraction (also called the Karp-Flatt Metric) [10] e = T ser T 1 = T ser T ser + T par (2) needs to be minimized. Currently available tools such as Intel Threading Building Blocks [12, 2], Microsoft Task Parallel Library’s For and ForEach loops and Microsoft PLINQ [14], Cilk [4] and its commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing loops within a subroutine, function, or method. However, the manner in which these approaches embed parallel programming within serial languages require that the sequential semantics of statements be preserved. In particular, consider two parallel loops such as parallel_for(int i=0; i!=iend; i++) { ... } 1 “Teflon,” if used herein, means “fluoropolymer” or “PTFE.” Such use was improvident, as Teflon® is a registered trademark of E.I. du Pont de Nemours and Co. in the US and elsewhere used as a brand of fluoropolymer additive, coating or resin and not a finished product.

TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

TokenNets: An Approach to Programming Highly Parallel

Measurement Science and Signal Processing

Lee Barford Kevin Mitchell Tom Vandeplas

January 7, 2010

1 Introduction

All industries that provide value-added through the use of software are facing a “multicorecrisis.” Clock rates of central processing units (CPUs) stopped rising at their former expo-nential rate in about 2004. Instead, processor manufacturers are increasing the number ofparallel execution units or cores provided per CPU. The number of cores is now increasingexponentially. In order for product performance to grow with improvements in processors,it will be necessary to program in such a way that this ever-growing number of parallelprocessors is used effectively.

However, for the sort of algorithms typically found in applications where the computa-tional complexity lies in signal processing and measurement science algorithms, currentlyavailable programming tools will fall short in exploiting the numbers of cores soon to beavailable. Gustafson’s Law gives an straightforward (but approximate) way of exploring thelimits of performance improvements available from parallelization. The Law can be written

TP = Tser +TparP

, (1)

where TP is the time required for a given piece of code to run on P processors, Tser is thetime spent in serial (un-parallelized) code, and Tpar is the time spent in the parallelizedportion. In order to maximize performace as P increases, the serial fraction (also called theKarp-Flatt Metric) [10]

e =TserT1

=Tser

Tser + Tpar(2)

needs to be minimized.Currently available tools such as Intel Threading Building Blocks [12, 2], Microsoft

Task Parallel Library’s For and ForEach loops and Microsoft PLINQ [14], Cilk [4] and itscommerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3]all provide capabilities for parallelizing loops within a subroutine, function, or method.However, the manner in which these approaches embed parallel programming within seriallanguages require that the sequential semantics of statements be preserved. In particular,consider two parallel loops such as

parallel_for(int i=0; i!=iend; i++) {

...

}

1

“Teflon,” if used herein, means “fluoropolymer” or “PTFE.” Such use was improvident, as Teflon® is a registered trademark of E.I. du Pont de Nemours and Co. in the US and elsewhere used as a brand of fluoropolymer additive, coating or resin and not a finished product.

Page 2: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

a = Agilent.DSP.foo(input);b = Agilent.DSP.bar(a);c = Agilent.MeasSci.baz(b);

in serial codein || library

rsfoo bar baz

processor setup

time

busy 

TTserial

Figure 1: Figure illustrating unnecessary serialization when only intra-routine paralleliza-tion is used.

parallel_for(int j=0; j!=jend; j++) {

...

}

A barrier, a point where threads of execution wait for all to reach the same point in thecode, is required between the two loops. The barrier is needed to guarantee the semantics,required by the serial language, that the entire first loop executes before any of the sec-ond loop executes. However, this constraint may or may not actually be required for thecorrectness of the program. An example of this would be if the second loop does not readany memory written by the first loop. For similar reasons, using current tools parallelism isextremely difficult or impossible to realize across function or method boundaries. Figure 1illustrates the resulting behavior obtainable with these currently available tools. Parallelsections alternate with serial sections and the serial fraction is large. This problem of notfinding enough parallelism has been experimentally verified to be a real one in the areas ofmeasurement science and signal processing algorithms. Figure 2 shows a performance anal-ysis of a routine that identifies the starting time of a data frame to sub-sample accuracy.The routine was parallelized usign Cilk++. It was timed using one through eight coresof an eight core (two dies of four cores each) Xeon class server. Twenty repititions weredone for each measurement. The parameters of (1) were fit to the measurements using alinear least squares procedure. Although the extrapolated performance shown is probablynot extremely accurate, it is nonetheless likely that little perforanace improvement can beexpected beyond 16 cores.

In order to get anything like P times speedup for P processors as P increases to eightand beyond, it will be necessary to find and exploit parallelism in signal processing andmeasurement science algorithms that crosses function and loop boundaries. Since parallelspeedup within individual algorithms, routines, or functions will not be sufficient to makefull use of the numbers of cores that will soon be available, it is desirable to acheive thesituation illustrated in Figure 3. In this figure, an operation on data can be run as soon as the

2

Page 3: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

0 5 10 15 20 25 30

0.00

00.

005

0.01

00.

015

Bebop PmblSync Cilk++ 2009−02−18

cores

proj

ecte

d tim

e (s

)

ideal speedupprojected timeasymptotic time (p−−>infinity)

Figure 2: Speedup of a frame synchronization routine showing projected performance in-crease with number of cores.

3

Page 4: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

a = Agilent.DSP.foo(input);b = Agilent.DSP.bar(a);c = Agilent.MeasSci.baz(b);

foo bar baz

g ( )

esso

rssetup

timebusy

pro

ce

TTserial

Figure 3: Parallelization of loops and routines with serialization only at points with truedependencies.

operations’ inputs are available. A barrier is required if it is required by true dependencies,that is, that all future work depends on the results being waited for.

This report describes a language-independent paradigm, called TokenNets, intended toprovide the sort of minimally constrained parallism illustrated in Figure 3. The report alsodescribes a first implementation of TokenNets, called TokenPilot, and the first significantapplication written in TokenPilot, which does signal processing often required for jitteranalysis of serial signals. While writing this application code, we identified several featuresfor making programming easier that could be added to TokenNets. These features arediscussed. The report concludes with a summary of the advantages we see in the TokenNetsapproach over other approaches that are available or discussed in the literature.

2 TokenNets

TokenNets is a language-independent scheme for writing DSP and measurment sciencealgorithms and entire measurement applications. Rather than expressing how code shouldrun in parallel, it provides a means for expressing the constraints on parallelism. Parallelexecution may procede in any way not prohibited by the constraints. The programmerwrites serial code–never parallel code–that is invoked by the TokenNets implementation.In other words, TokenNets separates the concerns of (1) specifying computations and (2)initiating and coordinating computations. The TokenNets implementation hides from theprogrammer the messy and error-prone details of parallel programming such as processes,threads, and locks or other mutual exclusion mechanisms.

We will first present the construction and behavior of a TokenNet without referenceto what language the computations are written in. First, what a TokenNet is and howit functions is described fairly informally by describing two small example algorithms inTokenNets. Then, a more formal description will be given. In the following section we willdecribe the first implementation of TokenNets, called TokenPilot, and dicuss some of thelanguage- and system-dependent features of that implementation.

4

Page 5: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

… …

grain 1 grain 2 grainm

… …

( )grain 1 grain 2 grain m(a)

i /

<g>

d ll l d

environment

min/maxof eachgrain

• can do in parallel andin any orderv1

overall • only 1 grain min/max

<g,min,max>

g

v2 min/maxonly 1 grain min/max

at a time• outputs 1 token whenall grains processed 

v2

<min,max>

(b)

Figure 4: Example TokenNet that computes minimum and maximum of an array. (a) Arraybroken up into grains or contiguous segments. (b) The TokenNet.

2.1 Informal Description

We will describe TokenNets informally by discussing two small example parallel algorithmsexpressed using TokenNets. These examples serve to show how considerable parallelismcan be obtained using simple TokenNets. (Neither represents the parallel algorithm withthe highest parallel speedup for its task. But these other algorithms from the literatureare rather more complicated. Presenting these more complicated algorithms would putattention on them rather than on TokenNets.)

Figure 4 shows a small TokenNet that is used to control the parallel computation ofthe minimum and maximum of a (presumably large) array. The array is divided into anumber m of contiguous segmets called grains, illustrated in Figure 4(a). m typically willbe somewhat more than the number of processors or cores P . Figure 4(b) showns theTokenNet. It consists of two nodes v1 and v2.

A program, referred to as the environment, that calls the minimum/maximum algorithmplaces tokens that contain indices of the grains, g = 0, 1, . . . ,m − 1. This creation andplacing of tokens is performed through an application program interface (API) provided bythe TokenNets implementation.

For each token received, vertex v1 computes the local minimum and maximum withinthe grain with the token’s grain index g. Then, the function creates and returns a tokencontaining the data g, the minimum, and the maximum. v1 places no constraint on parallelexecution. An unlimited number of computations of the grainwise minimum and maximumcan take place in parallel and in any order.

5

Page 6: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

compute min/maxof each grain

compute overallmin/max from

grain mins/maxes

put graintokens

time

Figure 5: Hypothetical timeline of execution of the TokenNet in Figure 4 with number ofcores P = 4 and number of grains m = 4.

Tokens output by v1 are sent to v2 along the edge that connects them. Vertex v2computes the minimum and maximum of the entire array from the grain-wise minima andmaxima. v2 has internal state: the minimum and maximum of the local minima andmaxima from the tokens it has processed so far, and the number of tokens processed sofar. When applied to a token (g,min,max), v2’s execute function fv2 compares the runningminimum and maximum to min and max and updates its internal state. When all m localminimum/maximum tokens have been processed, fv2 outputs a token containing the overallminimum and maximum of the array. In order that the internal state of v2 be updatedconsistently, only one token can be “running” at a time in v2. The parallelism in v2 mustbe constrained: execution must be limited to one token at a time.

Figure 5 shows a hypothetical timeline of the execution of the TokenNet in Figure 4. Anexecution of a local min-max finding routine begins as soon as the proper token is depositedinto v1. Those executions take much longer to run than the m executions of fv2 becausethe number of array elements in each grain is much larger than 1. Executions of fv2 canbegin as soon as partial results–the local minima and maxima–are available.

Figure 6 shows another example TokenNet that does transition localization (sometimescalled edge finding) in a one dimensional signal. The purpose of the net is to find thetimes (or sample indices) at which the signal transitions between its low and high states.Three voltages are given, called the low, medium, and high reference levels. Essentially, atransition is recorded when the voltage crosses the mid reference level, subject to criteriathat make it so that glitches are not considered transitions. A complete description of whatconstitutes a transition is found in the relevant IEEE standard [1].

The TokenNet in Figure 6 localizes all the transitions in a signal stored as samples in aset of grains. The top vertex in the figure takes two tokens at a time:

• one that contains the grains (indexed by acquisition a, channel c, and grain g and

• the other that contains the reference levels that have been computed previously, pre-sumably by another subgraph of the TokenNet.

Code associated with the vertex is executed when its edges have tokens with equal values ofa, c, and g. The vertex’s code localizes the transitions within one grain specified by its input

6

Page 7: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

an edge

<chan a, c, g> <referenceLevels a,c,g>

edge find one grain

• can do in parallel andin any order

h

<grainEdges a, c, g>

• process grains in order,g

patch beginning

process grains in order, from beginning to end ofwaveform

<edges a, c, g>

Figure 6: TokenNet that performs transition localation (edge finding) in a one-dimensionalsignal

tokens. It outputs a token containing the a, c, and g indices along with the localizations.The token also contains the logical state (e.g., low, high, or indeterminate) of the signal atthe end of the grain. The vertex’s parallelism is unconstrained, like vertex v1 in the firstexample.

Now these localizations may not be complete. It is possible to miss the first edge at thebeginning of each grain because it may not be possible to determine correctly the state of thesignal at the boundary between grains. The purpose of the second vertex is to identify whenthat occurs by examining the end of each grain and the beginning of the succeeding grain,then incorporate any missed transitions into its output tokens. The second vertex receivesthe tokens produced by the first vertex. However, the second vertex needs to examine logicalstate of the signal at the end of the i-th grain in order to determine whether a transitionis missed near the start of the i + 1-st grain. So, the second vertex also constrained in itsparallelism and in its execution order. That vertex can only be “running” one token at atime. Furthermore, it has to run the tokens in the order g = 0, 1, 2, . . ..

We have seen that TokenNets are directed graphs on which tokens flow. Tokens bothcontain (or contain references to) read-only data and control execution. A vertex has an

7

Page 8: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

associated function that runs when the right tokens are sent to it. Constraints on parallelexecution of a vertex are sometimes necessary. A vertex can specify that it can run on atmost one token at a time. A vertex can also specify that it must run on tokens in a certainorder. We now turn to making these notions more precise.

2.2 Formal Description

A TokenNet is an directed acyclic graph (DAG) with G = (V,E) with vertices V and edgesE ⊂ V × V . Let in(v) = {e ∈ E|(e, v) ∈ V } be the in-edges of vertex v. There is a totalorder on the in-edges of v, so they can be indexed

in(v) = {in(v)1, in(v)2, . . .}.

Let out(v) = {e ∈ E|(v, e) ∈ V } be the out-edges of v. Let the source vertices of G bewritten S(G) = {v ∈ V |in(v) = ∅}.

Each e is associated with a token type Te which is a finite cross product of types fromthe underlying language,

Te = Te,1 × Te,2 × . . . .

We abbreviate the input type in(v)i as IT (v, i). The overall token input type of a vertex,written IT (v), depends on whether or not the vertex is a source. When v /∈ S(G), thenIT (v) = ×0≤i<length(in(v))IT (v, i). On the other hand, when v ∈ S(G), then IT (v) must begiven.

Each edge e = (v1, v2) is also associated with a projection function Πe : Te → U(v2)where U(v2) is a tuple type. Note that Πe1 = Πe2 = U(v) for e1, e2 ∈ in(v).

The token types of all out-edges of an edge must all be the same. That is, Te1 = Te2for e1, e2 ∈ out(v). This token type is written OT (v) for short. When v is a sink, that isout(v) = ∅, without loss of generality OT (v) is void.

Each vertex v is associated with an execute function fv. This function has type IT (v)→OT (v). The functions fv are written in the underlying language. Each vertex v has an as-sociated parallelism constraint C(v) and a matching function mv of type IT (v)→ boolean.C(v) is an element of unconstrained, exclusive, sequential. When C(v) = sequential, a func-tion s(v) : U(v)→ N, called the sequence function of v, must also be given.

A token of token type T is an object from the underlying language of type T .The environment is an external body of code that initiates a TokenNets computation.

The environment does so by creating and placing tokens t into sources v ∈ S(G). Once atoken t is placed onto a source v, the function fv(t) may begin to execute.

When a function fv(t) finishes executing, it returns a token t′ of type OT (v). Token t′

is logically copied as necessary and a copy of t′ is placed on each out-edge out(v).The determination of when fv of non-source vertices may execute depends on C(v). If

C(v) = unconstrained, then when there exists token t1 on in(v)1, t2 on in(v)2, ldots, suchthat the Πin(v)i(ti) are all equal tuples, then the following happen:

• t1 is removed from in(v)1, t2 is removed from in(v)2, ldots, and

• fv(t1, t2, . . .) begins execution.

The test ofm(v), the removal of tokens, and the initiation (but not completion) of fv(t1, t2, . . .)’sexecution happen atomically.

8

Page 9: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

If C(v) = exclusive is similar except that only one invocation of fv(t1, t2, . . .) may beexecuting at a time. In other words, the same three steps happen when C(v) = exclusiveas when C(v) = unconstrained, but all three steps–including the running of fv(t1, t2, . . .)to completion–happen atomically. Vertices v with C(v) = exclusive are used to performoperations that cannot be done in parallel.

A vertex v with C(v) = sequential is used to handle the cases where an operation mustbe done in a particular order. If C(v) = sequential, then v behaves as if C(v) = exclusiveexcept that there is an additional constraint on the order of execution. For any value oft′ = Π(v′,v)(t), fv cannot be executed on token arguments with s(t′) = i + 1 before it isexecuted on token arguments with s(t′) = i and that execution has completed.

Whatever the value of C(v), when fv(t1, t2, . . .) is complete, its value (a token) is placedonto every out-edge of v.

Once the environment has placed a token onto a source of the TokenNet graph, executionof the TokenNet terminates until both

• No fv is executing for any v, and

• No token is waiting on any edge.

As a practical matter, a TokenNets implementation must provide a means of notifying theenvironment once termination has occured.

3 TokenPilot: The Initial Implementation of TokenNets

The first implementation of TokenNets is TokenPilot. It is written as a set of C# 4.0classes. TokenPilots uses the lightweight task creation and monitoring facilities providedwith .Net 4.0’s Task Parallel Library. These facilities are, in turn, implemented on top of atask-stealing scheduler. Such schedulers distribute work dynamically and greedily. Work-stealing schedulers approach optimal load balancing asymptotically [9]. Intuitively, near-optimal load balancing should lead to good parallel speedup. (However, intuition can bewrong in the presence of complex memory hierarchies, cache coherency hardware, and othercomplications. So this is a plausibility argument for, not a proof, of good performance.)One way of viewing the TokenNets approach is as a way of generating a large number oftasks that keep a near-optimal schedule well-supplied with tasks to schedule.

The TokenNet concepts of graph, vertex, and token are implemented as objects. In whatfollows, symbols written in italics like v refer to the TokenNet abstractions discussed in theprevious subsection. Symbols written in typewriter font link v refer to C# objects in theTokenPilot implementation. A graph can be built up out of vertices and edges dynamically,tokens created and inserted into the graph, graph termination monitored, and the graphdestroyed–all under program control. Thus, C# programs can write and execute TokenNets.This ability is useful in many instrumentation contexts because measurement algorithms tobe run can change dynamically upon user input or commands delivered to an instrumentvia a computer network.

A token may be any object that provides an Equals() method and a GetHashCode()

method that are consistent in the sense that if two tokens t1 and t2 have t1.Equals(t2)

then t2.Equals(t2) and t1.GetHashCode()==t2.GetHashCode(). However, for conve-nience a class hierarchy of tokens is provided that are tuples with low arity or C# types orclasses is provided.

9

Page 10: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

A vertex v has a method Execute() which is overriden to provide the execute functionfv. v also has a method that provides the projection function Π(e) for each of v’s in-edges e. These projection methods default to identity functions but may be overridden.A base vertex class is provided with in-degree 1, 2, and so on. The base vertex classeshave template arguments for the types of tuples expected on the in-edges. This guaranteesthat only graphs that are type-correct in the sense that only tokens of expected types canbe placed on edges can be created. The base vertex classes are unconstrained. Subclassesprovide exclusive and sequential vertices. The constructor of a vertex v takes the verticesat the other end of v’s in-edges as arguments. So, only acylic graphs can be created.

The vertex base classes provide a method that subclasses can use to send a token out theout-edges of a vertex. In order to maximize available parallelism, that method immediatelycreates a task for each out-edge. Each out-edge task determines whether the placement ofthe token on that edge causes the Execute() method of the destination vertex to run ornot. If the Execute) method is to run, it is run by that task.

Like in other programming paradigms, in TokenNets there are idioms that are oftenrepeated. In TokenNets, rather than a textual form, such an idiom is a commonly-usedsubgraph. An example is the subgraph that corresponds to the map/reduce [6] parallelprogramming construct. Figure 4 is an example of a map/reduce subgraph in TokenNets.v1 provides the map computation, mapping grains to local minima and maxima. v2 isthe reduce step, reducing the grain-wise minima to the overall minimum and the grain-wise maxima to the overall maximum. TokenPilot is designed to provide functions thatbuild such common subgraphs. Currently, a function is provided that builds map/reducesubgraphs. The function takes the vertex providing the in-edge as an argument and themapping function and the reduce function as function-valued arguments.

The TokenPilot program corresponding to the TokenNet in Figure 4consists of two classdefinitions–one for each vertex–and a main program. There is also a simple struct thatcontains minima and maxima:

struct MinMax

{

public double min;

public double max;

}

The class of the vertex that finds the grain-wise mimima and maxima (v1) is declared

class GrainMinimizerVertex :

Vertex<TokenIndices1, VectorGrain<double>, TokenIndices1, MinMax>

{

...

The subclass Vertex<TokenIndices1, VectorGrain<double>, TokenIndices1, MinMax>

is the class of vertices that have one in-edge that can send tokens containing one data itemof class VectorGrain<double> and have out-edges that can send tokens containing onedata item of class MinMax. A VectorGrain<double> contains a reference to one grain froma vector (array) of doubles. The Execute() function uses .NET extension methods tocompute the grain-wise minimum and maximum without any explicit loops. The methodthis.SendToken(), provided by the Vertex base class sends copies of the resulting tokento any vertex with an in-edge originating in this vertex:

10

Page 11: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

protected override

void Execute(Token<TokenIndices1, VectorGrain<double>> tok1)

{

VectorGrain<double> grain = tok1.Data;

MinMax mm;

mm.min = grain.Min();

mm.max = grain.Max();

this.SendToken(

new Token<TokenIndices1, MinMax>(tok1.Indices, mm));

}

The other vertex (v2) needs to have the exclusive constraint on parallelism. So itsclass derives from a TokenPilot-provided class that ensures that only one invocation of theExecute() method can be executing at a time.

class OverallMinMax :

MutuallyExclusiveVertex<TokenIndices1, MinMax, TokenIndices0, MinMax>

{

...

The Execute() method tracks the number of tokens seen and outputs a token when the allthe necessary tokens have been received.

protected override

void Execute(Token<TokenIndices1,MinMax> tok1)

{

MinMax tokenMinMax = tok1.Data;

this.mm.min = Math.Min(this.mm.min, tokenMinMax.min);

this.mm.max = Math.Max(this.mm.max, tokenMinMax.max);

grainsSeen++;

if (grainsSeen == numInputTokens)

{

Console.Write("Overall Min="); Console.Write(this.mm.min);

Console.Write(" Overall Max="); Console.WriteLine(this.mm.max);

this.SendToken(

new Token<TokenIndices0, MinMax>(new TokenIndices0(), mm));

}

}

The main program creates and fills a vector with data, builds a graph out of the twovertices, inserts tokens into the graph, and waits for the computation to complete.

static void Main()

{

// Create vector containing data to min & max

int size = 1<<16;

int grainSize = size>>4;

GrainedVector<double> v = new GrainedVector<double>(size, grainSize,

(int i) => Math.Sin(i));

// Create the graph

Graph graph = new Graph();

11

Page 12: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

GrainSource<double> source = new GrainSource<double>(graph);

GrainMinimizerVertex v1 = new GrainMinimizerVertex(source, graph);

OverallMinMax v2 = new OverallMinMax(v1, v.Grains.Count, graph);

// Run the graph and wait for it to terminate

source.PutGrains(v);

graph.WaitAll();

}

The output of the program is shown in Figure 7.

4 Example

4.1 Introduction

The example is based on the IEEE-181-2003 standard [1] for the description of waveforms.We will analyze a pulse waveform and determine the time interval error. For the sake ofthe example, some simplifications have been made to the actual algorithm. For a givenwaveform, the time interval error can be determined as illustrated in Figure 8.

The first step in the process will determine the state levels, or reference levels usinga histogram, for simplicity a bimodal amplitude distribution is assumed. To generate thehistogram, the amplitude range must be divided in M unique amplitude intervals. Theamplitude interval is called the histogram bin width, ∆y. The amplitude range, yR, isdetermined by the minimum and maximum amplitude values: yR = ymax− ymin. For equalsized bins, ∆y is found by dividing yR by M :

∆y =yRM

=ymax − ymin

M.

Next the state level boundaries are determined based on the histogram bin counts. Thehistogram is split into the upper and lower histogram. The low state level is given bythe mode of the lower histogram, the high state level is given by the mode of the upperhistogram.

Once the low or and high state levels have been determined, the reference levels can becalculated directly from those parameters.

A = level(s2)− level(s1)

y10% = level(s1) +|A|100

10%

y50% = level(s1) +|A|100

50%

y90% = level(s1) +|A|100

90%

where

A is the amplitude of the waveform

level(s1) is the low state level

12

Page 13: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

Dual core,

two threads

Figure 7: Screenshot of code and execution trace of TokenPilot implementation of theminimum/maximum example of Figure 4

13

Page 14: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

Data SourceMake

Histogram

Find

Min/Max

Calculate

Reference

Levels

Calculate Fine

Unit Interval

Count Unit

Intervals

Estimate Rough

Unit Interval

Calculate

Intervals

Find

Epochs / Edges

Determine

sample state

Calculate Time

Interval Error

Figure 8: Jitter Analysis Block Diagram

level(s2) is the high state level

yx% is the value of the reference level.

Next the waveform is scanned for for subepochs, edges and transitions, this is called theparsing process . The first step in the parsing process assigns state levels to each amplitudevalue by comparing the waveform value to the upper and lower boundaries of the statesdefined by the reference levels. If the waveform value is contained withing the boundariesof a state, the waveform value is assigned a state level. If a waveform value is not withinthe boundaries of any state the value is a assigned the undefined state.

state(tn) =

lowstate, for ymin ≤ ti ≤ y10%

undefinedstate, for y10% < ti < y90%highstate, for y90% ≤ ti ≤ ymax

The waveform will be decomposed into subepochs based on the assigned state levels. Asubepoch is defined by the state value and the start time. To filter out false subepoch statesthe length of each subepoch defining an assigned state is compared against a predefinedminimum state duration. If the requirement is not met, the subepoch is assigned theundefined state and merged together with adjacent subepochs with an undefined statelevel.

∀t ∈ [tx, ty] : state(yt) = S , ty − tx > dmin

After the subepochs have been merged, we will look for state transitions. A transition isdefined by either two adjacent subepochs with different state values or alternatively bythree adjacent subepochs where the state level of the second subepoch is undefined whilethe state level of the first subepoch is different from the state level of the third subepoch.Each transition or edge is defined by it’s interpolated y50% crossing time. The edge times

14

Page 15: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

ymaxy90%

y50%

y10%ymin

Figure 9: Valid transitions between state levels

are used to determine the unit time interval of the composite waveform. This is a two stepprocess, first a rough estimate of the unit time interval is determined. One way to define therough estimate is by looking for the minimum time interval between two edges. A probablybetter method would be to use the 25% percentile of the historgram of the time intevals,this will filter out possible outliers. For sake of simplicity we will use the first method inthis example.

Using the rough estimation of the unit time interval we can determine the total numberof unit time intervals between the first and last edge in the composite waveform by summingthe quantified number of estimated unit intervals between all pairs of two adjacent edges.Now the total number of unit time intervals is known we can calculate the precise unit timeinterval. Assuming the first and last edge in the composite waveform have no time error,the unit time interval is the time between the first and last edge divided by the number ofunit time intervals.

∆ti = t(Ei+1)− t(Ei)

θ = min(∆ti)

N =∑i

⌊∆tiθ

+ 0.5

⌋Θ =

t(En)− t(E0)

Nwhere

t(Ei) is the time of the i-th edge

∆t is the time interval between two adjacent edges in the composite waveform

θ the rough estimate of the unit time interval

N is total number of unit time intervals

Θ is the unit time interval.

Finally, the time error for each edge can be determined by comparing its measured timeagainst the expected time which is an integer multiple of the unit time interval calculatedin the previous step.

15

Page 16: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

class GrainSource<T> : Source<TokenIndices2, GrainedVector<T>> { private int _FrameId;

public GrainSource( Graph graph,VertexDescriptionAttribute description = null) : base(graph, description) { _FrameId = 0; } public void PutGrains(GrainedVector<T> input, int Channel) { this.SendToken(new Token<TokenIndices2, GrainedVector<T>> (new TokenIndices2(_FrameId, Channel),input)); _FrameId++; } }

Figure 10: GrainSource definition

4.2 Implementation

The example will be implemented using the TokenPilot library. The basic concept of thelibrary is to break down the algorithm into smaller tasks, which can then be executedin parallel. The implementation examples below will highlight different implementationoptions and challenges. The code below is by no means complete. For the complete sourcecode please contact the authors.

As a first step the input data, referenced to as the input vector, will be split into smallerchunks or grains. The number of grains is chosen based on the available cpu cores and thenature of the algorithm. For this case the number of grains is set to twice the number ofavailable cores. In graphical representation, the indices will be marked by < x, y, z >, thedata is marked as {d}.

A class GrainedVector manages a vector of data and provides for breaking it up andacessing it as a set of grains. A data source type will be implemented that will feed tokenscontaining data of type GrainedVector into the graph. For the sake of the example, firsta data source vertex is created that sends out tokens of type GrainedVector (Figure 4.2).Next a vertex is added that will split the GrainedVectors into their grains. In a real worldsituation, it would be preferable to combine both operations in a single vertex to reduceoverhead. This example illustrates how custom types are used to define a vertex. Later onwe’ll illustrate how a vertex can be created without the need of an extra custom class.

At the highest level of granularity, the GrainedVector, token indices mark the identityof the vector (say as a sequence number indexing data acquisitions) and the identity ofthe measurement channel the vector came from. This channel ID illustrates how extraidentifiers can be added to a token.

After the vector is split into grains additional indices will mark the identity of the grainand communicate the total number of grains in the vector. using the GrainsSource

and VectorSplitter classes, an initial graph is created. As illustrated in Figure 4.2, aGrainedVector is initialized from an array of numbers, which are samples. Notice theindexing of the array using a lambda function, a feature that provides a flexible means ofdefining the data in a GrainedVector.

16

Page 17: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

class VectorSplitter : Vertex<TokenIndices2,GrainedVector<double>, TokenIndices4,VectorGrain<double>> { public VectorSplitter( TokenSender<TokenIndices2, GrainedVector<double>> inVertex, Graph g, VertexDescriptionAttribute d = null) : base(inVertex, g, d) { }

protected override void Execute(Token<TokenIndices2, GrainedVector<double>> tok1) { GrainedVector<double> data = tok1.Data; TokenIndices2 ind = tok1.Indices; int numGrains = data.Grains.Count; for (int i = 0; i != numGrains; i++) { this.SendToken(new Token<TokenIndices4, VectorGrain<double>> (new TokenIndices4(i, numGrains ,ind[0], ind[1]), data.Grains[i] )); } }}

Figure 11: VectorSplitter definition

Double[] data;int numberOfGrains = 16;...GrainedVector<double> v = new GrainedVector<double>(numberOfGrains , grainSize, (int i) => data[i]);

// Create the graphGraph graph = new Graph();

//A data source is added to the graph GrainSource<double> source = new GrainSource<double> (graph,new VertexDescriptionAttribute("GrainedVector Source"));

//Split the Vector into it's grainsVectorSplitter GrainSource = new VectorSplitter (source, graph, new VertexDescriptionAttribute("Grain Source"));

Data Source

<vec#,chan#>

{GrainedVector<double>}

Split into Grains

<gr#,numGr vec#,chan#>

{VectorGrain<double>}

Figure 12: Initial Graph containing a data source

17

Page 18: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

4.2.1 Reference Level Calculation

To calculate the reference levels the overal minimum and maximum amplitude values haveto be determined. This is done in a way similar to that described in section 2.1.

The implementation will make use of the MapToken and ReduceToken classes that areprovided in the TokenPilot library. These types allow a user to create functional verticeswithout the need to define a custom class by making use of lambda functions as parameters.To improve readability of the code, this method is only used for simple vertices.

The MapToken type will apply the provided lambda function to every token presentedon its in-edge. The result of this specific MapToken vertex is the MinMax of the input grain,where MinMax is a type containing a minimum and maximum value. Since MinMax has theaddition operator defined two values can be easily combined by adding them. To calculatethe MinMax of the enumerable VectorGrain the C# Aggregate function is used.

The ReduceToken class combines a limited number of different tokens based on a com-mon feature. In this case ReduceToken is used to combine the partial MinMax resultsgenerated by the previous step. To define the common feature we rely on the indices, onlytokens with matching last two indexes will be combined. In this example the last two in-dexes represent the vector ID and channel ID. The second index, the grain count, is usedto limit the number of tokens that are combined. The output of this vertex will be a tokencontaining the MinMax value of an input vector, identified by the vector ID and the chan-nel ID. Internally the ReduceToken vertex will define vertex local storage to accumulatethe different input tokens. Once all the required tokens are available an output token isgenerated after which the input tokens are discarded.

Finding the histogram and reference levels are handled similar to how minima and max-ima are computed. To generate a histogram we will first define the edges and bin width of thehistogram, these values are based on the MinMax of the vector and again combined in a datatype called HistogramDescription. A single token containing a HistogramDescription

will be generated per Vector. To generate the histogram itself a Vertex will calculatethe partial histograms for each grain which will then be combined into the overall his-togram for the vector, similar to how minimum/maximum is computed. However, to cal-culate a partial histogram we’ll need two inputs: the grain containing the samples, and theHistogramDescription containing the information on how to create each histogram. Sincefor each vector only one HistogramDescription is available while there will be a number ofgrains per vector, we repeat the HistogramDescription. Every grain token will now havea HistogramDescription token with matching indices. Based on the histogram, a singletoken per vector will be generated containing a data type defining the reference levels forthat specific vector.

4.2.2 Waveform Parsing

The waveform parsing step identifies edges in a vector based on the reference levels. Again,we will start by parsing each grain individually, similar to the way we constructed thehistogram. For each grain we’ll start by defining epochs, each of which may contain oneedge or glitch. Using the epoch definitions we’ll look for edges. Every set of three epochscan have one out of two edge locations, see Figure 4.2.2.

However, the boundary epochs cannot be defined as they might be the continuationof an epoch defined in the previous or next grain. In the worst case, after combining the

18

Page 19: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

//Find the MinMax for each GrainMapToken<TokenIndices4, VectorGrain<double>, MinMax> MinMaxInGrain = new MapToken<TokenIndices4, VectorGrain<double>, MinMax> (GrainSource, graph, token =>token.Data.Aggregate<double,MinMax >(MinMax.Default,(acc,d)=>acc+d), new VertexDescriptionAttribute("Grain Min/Max")); //Combine the Partial ResultsReduceToken<TokenIndices4, MinMax, TokenIndices2, MinMax> OverallMinMax = new TokenPilot.Utilities.ReduceToken<TokenIndices4,MinMax,TokenIndices2,MinMax> (MinMaxInGrain,graph, ind => new TokenIndices2(ind[2], ind[3]), //Matcher ind => ind[1], //Size (acc,inp)=>acc+inp, MinMax.Default, new VertexDescriptionAttribute("Vector Min/Max"));

Min / Max

Per Grain

Overall

Min / Max

Of Vector

<gr#,numGr vec#,chan#>

{MinMax}

<vec#,chan#>

{MinMax}

<gr#,numGr vec#,chan#>

{VectorGrain<double>}

Figure 13: Find Minimum,Maximum

Ep0 Ep1 Ep3

Figure 14: Detected Egde locations in a set of three Epochs

grains an edge might even be thrown away because it doesn’t don’t meet the minimumlength criterion. Because of this the first and last four possible edge locations cannot bedetected by making use of a single grain, see Figure 4.2.2.

For each grain, a token will be generated containing the detected edges for that grainplus the information on the boundary epochs to allow overlap processing in the next vertex.Combining the grains to detect the remaining edges imposes a new issue: not only we’llhave to combine the grains of a specific vector, we’ll also have in process them in an ordered.way. TokenPilot provides a helper class to facilitate this: the OrderedAggregateVertex

will combine the data in a set of matching tokens into a token containing an ordered arrayof the data.

Next we implement a vertex that combines the elements of the ordered array. Ratherthan creating a single output token containing all the edges in the vector we will completethe individual grains with the missing edges. In addition to that, in order to make processingin the next stages more easy, we will repeat the last edge of a specific grain in as the firstedge in the next grain. The generated tokens will contain an ordered array of edge times.

4.2.3 Unit Time Interval Calculation

Finally we can start calculating the unit time interval. The vertices that do so are createusing the techniques explained above. First we will calculate the time between edges ineach grain using MapToken. Next we look for the minimum interval using a combination ofMapToken and ReduceToken, similar to the example above. The output of this ReduceTokenvertex is a rough estimate of the unit interval. Next a vertrx is implemented to determinethe number of intervals in each grain, based on the rough estimate of the unit time interval.As usual, a second ReduceToken reduces the partial results to compute total number of unit

19

Page 20: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

Ep0 Ep1 EpN….

Grain

Undetected

Edge Locations

Undetected

Edge Locations

Detected

Edge Locations

Figure 15: Detected Edge locations in a Grain

//Aggregate all partial ParseResults --> output is ParseResult[]TokenPilot.Utilities.OrderedAggregateVertex <TokenIndices4, ParseWaveForm.ParseResult, TokenIndices2> parseResultAggregate = new TokenPilot.Utilities.OrderedAggregateVertex <TokenIndices4, ParseWaveForm.ParseResult, TokenIndices2>( grainEpochParser, (TokenIndices4 ind)=> new TokenIndices2(ind[2], ind[3]),//Match (TokenIndices4 ind)=> ind[0], //Order (TokenIndices4 ind)=> ind[1], //Size graph, new VertexDescriptionAttribute("Combine Grain Epochs") );

Figure 16: Using the OrderedAggregateVertex to combine Tokens

20

Page 21: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

//Find the first and last edge in the vector ReduceToken<TokenIndices4, double[], TokenIndices2, MinMax>firstLastEdge= new ReduceToken<TokenIndices4, double[], TokenIndices2, MinMax> ( GrainStitcher, graph, ind => new TokenIndices2(ind[2], ind[3]), ind => ind[1], (acc, d) => acc + new MinMax() { min = d[0], max = d[d.Length - 1]}, MinMax.Default, new VertexDescriptionAttribute("First/Last Edgetime") );

Figure 17: Find the first and last edge using a MinMax data type

//Based on the first and last edge, calculate the UIJoinTokens<TokenIndices2,int,MinMax,double[]> calculatePreciseUI = new JoinTokens<TokenIndices2,int,MinMax,double[]> ( sumGrainUIs,firstLastEdge, graph, (T1,T2)=> new double[] {T2.Data.Range/ T1.Data,T2.Data.min}, new VertexDescriptionAttribute("Precise Unit Interval") );

Figure 18: Calculate the precise Unit Interval using the JoinTokes type

intervals within the vectorAt the same time, we look for the first and last edge in the vector. We do this using

a ReduceToken vertex by making use of the addition operation of the MinMax type (Fig-ure 4.2.3). The resulting token contains a MinMax with the time of the first edge as theminimum, and the time of the last edge as the maximum.

Combining the number of unit intervals and the time between the first and last edgeprovides a precise unit interval under the assumption that the clock frequency is constant.TokenPilot provides a helper class to facilitate simple operations based on two Tokens:JoinToken class, see Figure 4.2.3. The complete TokenNet is illustrated in Figures 4.2.3and 4.2.3.

4.3 Performance

The example was benchmarked both for scalability and raw performance. To benchmark thescalability, speed tests have been run using a variable degree of parallelism. To allow this,a custom scheduler has been implemented which exposes the number used cores througha programmable number of worker threads. Forty repitions were performed for each num-ber of worker threads. A box plot showing the number of worker threads versus veriouspercentile levels of speedup T1/TP is shown in Figure 21. Approximately 7x speedup wasobtained on eight cores using nine worker threads. The raw performance was benchmarkedby comparing the TokenPilot implementation against a traditional serial implementation.This experiment exposes the overhead in the current implementation. On the eight core testmachine we measured a speedup of about 4x when compared to the serial implementation.

21

Page 22: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

Min / Max

Per Grain

Overall

Min / Max

Of Vector

<gr#,numGr, vec#,chan#>

{MinMax}

<vec#,chan#>

{MinMax}

<gr#,numGr, vec#,chan#>

{VectorGrain<double>}

Vector

Splitter

<vec#,chan#>

{GrainedVector<double>}

Data

Source

Histogram

Setup

<vec#,chan#>

{HistogramDescription}

Repeat

Histogram

Setup

<gr#,numGr,vec#,chan#>

{HistogramDescription}

Grain

Histogram

<vec#,chan#>

{uint[]}

Overall

Histogram

<gr#,numGr,vec#,chan#>

{uint[]}

Find

Reference

Levels

<vec#,chan#>

{ReferenceLevels}

Figure 19: First half of the complete graph

22

Page 23: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

<gr#,numGr,vec#,chan#>

{ReferenceLevels}

Repeat

Reference

Levels

<vec#,chan#>

{ReferenceLevels}

Grain Parser

Ordered

Aggregate

Vertex

<gr#,numGr,vec#,chan#>

{ParseResult}

<vec#,chan#>

{ParseResult[]}

<gr#,numGr, vec#,chan#>

{VectorGrain<double>}

Vector Parser

(Stitcher)

<vec#,chan#>

{GrainedVector<double>}

<gr#,numGr,vec#,chan#>

{double[]}

Calculate Grain

Intervals

<gr#,numGr,vec#,chan#>

{double[]}

Minimum Grain

Interval

<gr#,numGr,vec#,chan#>

{double}

Overall

Minimum

Interval

Repeat

Minimum

Interval

<vec#,chan#>

{double}

Grain Interval

Count

<gr#,numGr,vec#,chan#>

{double}

Overall

Interval

Count

<gr#,numGr,vec#,chan#>

{int}

First/Last Edge

Precise

Unit Time

Interval

<vec#,chan#>

{int}

<vec#,chan#>

{MinMax}

Figure 20: Second half of the complete graph

23

Page 24: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

TokenPilot Jitter 8 cores 2009-10-28

67

5

p

4

spee

dup

23

1

1 2 3 4 5 6 7 8 9 10

worker threads

Figure 21: Speedup as a function of number of worker threads. Red line indicates idealspeedup.

5 Related Work

The problem of safely permitting as much work as possible to be carried out using a sharedresource is an old one. Perhaps its earliest occurrence was in railroading. By the 1840’s,railroads had developed to the point where a railroad would typically have multiple trainsthat needed to transit a particular route, often in opposite directions. However, few railroadshad multiple parallel tracks. The practice of separating trains simply by assigning each traina static schedule were, literally, disastrous. Many railroads developed schemes to ensurethat only one train occupied a segment of track at a time by requiring that a physical objectcalled a token be carried in the cab of the locomotive. Originally, there was only one tokenfor one segment of track. Due to the need to send the token back to the starting point ifmore than one train in a row needed to go the same direction and other complications, oftenthe token was a person–a railroad employee. This person would wear distinctive clothing,often an arm band, and a title such as Token Man, Pilot Man, or (more rarely) the TokenPilot. By the latter nineteenth century semi-automated schemes using electrically-signaleddevices dispensing wooden or metal tokens largely replaced the human tokens. The term“token” and the name “TokenPilot” pay homage to this history.

24

Page 25: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

Petri [11] introduced the idea of modeling conurrency by tokens that move around adirected graph. Petri nets are extremely well-known and have given rise to an enormousliterature and a number of related schemes, so they will not be discussed further here.

Gelertner’s Linda [8] was the most notable coordination language for concurrent anddistributed systems. When using a coordination language, computation is done in a tradi-tional, serial language. Issues of concurrency and communications are handled as separatecode written in the coordination language. In Linda, there is a logically shared tuple spacethat is available to all threads of execution. The tuple space is a bag containing dataitems, each of which consists of ordered data fields. Communication and coordination isperformed by operations such as (1) writing a tuple to the tuple space and (2) patternmatching, reading, and removing a tuple from the tuple space.

Intel Concurrent Collections [5] (abbreviated CnC) is the closest direct ancestor of To-kenNets. Like in TokenNets, code is written for CnC by writing functions in an implemen-tation language. Each function is associated with a vertex in a graph. CnC uses tags tomanage data flow in a manner analogous to tokens in TokenNets. Step tags are are specialtoken-like objects that also manage control flow. A vertex’s function can be executed whenboth the required tags and a matching step tag are present in the vertex. CnC graphscan have cycles, so arbitrary control structures can be created using CnC graphs. CnCgraphs are specified statically, at compile time, using a textual language. A program readsthe textual description of the graph then generates C++ code that implements the parallelcontrol and data flow functionality described by the graph. The CnC programmer writesC++ code containing the functions that the generated code calls to implement the vertices’functionality. CnC makes use of TBB’s parallel data structures and work-stealing schedulerbut not TBB’s parallel control structures.

6 Future Directions

6.1 Structured Tokens

In the current TokenNet implementation the token indices are represented by a flat tupleof integers. However, it could be argued that this approach is too simplistic, and forcesthe user to write code in the application that would be better supported by the library.For example, consider the situation where we are processing a data stream representing asampled signal. It would be natural to use one index to hold the sample number. However,if the signal is represented by a large array of values we might split the array into a numberof smaller grains to increase the potential for parallel processing. In such as case we woulduse a second index to record the grain number.

Now consider the situation where a graph vertex V needs to process all the grains withina sample before emitting a result token. The number of grains may vary from sample tosample, particularly if the sample size itself can vary. So the first problem is to know howmany grains make up the ith sample. The only way the vertex responsible for splittingup the sample can communicate this information to vertex V is via the tokens themselves;there should be no other form of communication between vertices in a TokenNet graph.A token has two components, an index tuple and a data value. We must therefore eitheradd the number of grains as an additional index to the index tuple, or create a compositedata value containing this information along with the “real” data value. Neither of thesealternatives is particularly attractive.

25

Page 26: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

For a given sample index the vertex V is presumably maintaining some application-specific state that is updated as each grain for this sample is processed. Once the lastgrain for the sample has been analyzed the vertex will typically output a token whose datacomponent is built from this sample state, and the storage can then be relinquished. Ideallythe book-keeping involved in maintaining sample-specific storage should be delegated to theTokenNet infrastructure, not reimplemented in each new application.

Many operating systems support the notion of thread-local storage, areas of memory thatare thread-specific and are automatically reclaimed when the thread expires. The conceptfrees the developer from maintaining explicit thread-to-memory maps and the recyclingof this memory. Ideally a TokenNet developer should be able to rely on a similar level ofabstraction, i.e. a notion of token(set)-local storage. What is the lifetime of such storage. Inour previous example the lifetime was from when vertex V first saw a grain for a particularsample until all the grains for the sample had been processed by V . More generally, itbecomes clear that modeling token indices by a flat tuple is too simplistic; the indicesshould have some form of hierarchical structure.

Currently a token index tuple such as (i1, i2, . . . , in) can be viewed as an element of theindex set I1 × I2 × . . . × In. But there is another way of constructing such indices. Theconcept is perhaps easiest to describe using an example. Let us generalize our previousexample to also include the concept of channel. For example, an N-channel oscilloscopewould produce multiple streams of samples, one for each channel, and these could all beinterleaved on a single graph edge. In the current version of the TokenNet API we mightthen attach the token index (c0, s0, g#, i) to the ith grain for sample s0 on channel c0, whereg# is the total number of grains in this sample.

Let us now replace the use of a Cartesian product by a mutually-recursive definitionof token index tuples and token index sets. Each token index set S will have a definedcardinality, possibly infinite, denoted by |S|. Given an index i, where 0 ≤ i ≤ |S|, wecan construct a new token index (S, i). A token index is therefore formed from a tokenindex set and a number. To complete the mutually recursive definition we also allow atoken index set to be constructed from a token index by specifying a cardinality for the newset. The term 〈(S, i), N〉 therefore represents a new token index set, with cardinality N .To bootstrap the definitions we also define the distinguished element ⊥ as a token index.Returning to our original example, we can now form the channel index set as 〈⊥, N〉, whereN is the number of channels. A particular channel index, for example for channel 1, wouldthen be represented by (〈⊥, N〉, 1). The index could be used to build a sample index set.In this case the number of samples is potentially unbounded, and so the set would berepresented by 〈(〈⊥, N〉, 1),∞〉. A token index for sample j would then be represented by(〈(〈⊥, N〉, 1),∞〉, j). If this sample is broken down into g# grains then the correspondingtoken set would then have g# as its cardinality, and so on. Of course writing tokens in thisfashion is rather tedious, and so where the cardinalities of the sets can be deduced fromthe context we can use the original notation, simply writing the indices as an unstructuredtuple.

The primary advantage of the structured approach is that it makes the cardinalities ofthe underlying sets explicitly available to us, and other library code. We no longer have toview the grain count, g#, as an additional index. We can simply determine this value byretrieving the set that generated the token index, and then querying for the cardinality ofthis set. The lifetime of token-local storage is also easy to define. In our previous example

26

Page 27: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

we wanted vertex V to maintain some local state for each token index set of the form〈(〈(〈⊥, N〉, ci),∞〉, sj), g#〉. When a token is received whose token index is generated froma set that doesn’t match any of the sets currently being processed a new storage element iscreated. The vertex then expects to see g# tokens built from this set, including the currentone; the set, and its associated storage, can then be reclaimed. The structure imposed onthe token indices allows us to automate this process. The application programmer wouldsimply need to indicate when token-local storage is required, and the type of the datarequired to be stored.

The developments described in this section are simply a proposal at this stage, and arenot supported by the current version of the API. However, if the TokenNet approach was tobe developed further then it would be worth exploring in more detail the merits of such aproposal, and its ramifications on the implementation of the library. From the perspectiveof the debugger the changes would be relatively minor. At present the maintenance of anytoken-local storage would be hidden inside the application code, and therefore invisible tothe graph viewer. If this state was directly supported by the TokenNet library itself then itwould be natural to expose the mapping in the graph view. A developer would be able toinspect the current set of active token sets in a vertex, drilling down into the state associatedwith each set when required.

6.2 Typed Ports

The current implementation of the TokenNet API defines different vertex base classes basedon the number of incoming edges. Maintaining this code is tedious as all the classes followthe same basic pattern. Furthermore, it imposes an artificial upper limit on the number ofedges supported by the library; it’s unlikely that a developer wanting ten incoming edgesto a vertex will find the library defines a vertex class supporting this number of edgesfor example. Each edge is associated with its own token type. But tokens, as we haveseen earlier, have two components, a token index tuple, and a data type. The library usesgeneric types to abstract away from specific types for these components and so each edge isassociated with two type parameters, one for the tuple index and one for the data. Whilstsuch an approach works well for vertices with a single incoming edge, as the number ofedges rises the syntactic clutter arising from all these type arguments starts to overwhelmthe code.

In the context of a graph viewer there is a second problem. The API orders the incomingedges to a vertex, whereas a typical graph layout algorithm assumes the edges are unordered.Forcing a layout algorithm to respect the ordering, for example from left-to-right, or top-to-bottom depending on the graph orientation, would severely constrain the degrees offreedom of the layout, potentially leading to greatly increased layout times and poor layouts.Unconstraining the layout algorithm and then annotating the edges with their index mightalso produce confusing results. The current graph viewer assumes that the context providesenough information to allow the user to deduce edge indexes from the graph view. Forexample, if the source of the first edge is a vertex of type T1 and the source of the secondedge is a vertex of type T2 then we rely on the user being able to distinguish vertices ofthese types in the graph view to be able to determine which edge in the view correspondsto the first edge in the API.

A TokenNet graph exchanges data over typed edges. It can be argued that the focusof the current API is at too high a level, and we should really focus on typed input and

27

Page 28: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

output ports, and the edges that connect them. In this view a vertex contains a set of inputports and output ports. The “work” function associated with the vertex retrieves tokensfrom the input ports, processes them, and then emits one or more tokens on the outputports. Given such a starting point it is then possible to define the existing API on top ofthis model if needed, but we can also package up the functionality in other ways as well.Unfortunately there’s a complication. The code we need to execute when receiving a tokendepends on the number and types of all the input ports. This is due to the TokenNet tokenmatching semantics in the case of multiple edges. The current implementation handles thiscomplexity by defining different code to handle each possibility, hence the need for multiplevertex classes. We do not want the user to have to write all this boiler-plate code withinthe application itself.

One approach to supporting typed ports as the basic building block uses partial classesand a two-pass compilation strategy. Each vertex is defined using a partial class uniqueto the vertex. Using a partial class it is possible to split the definition of a class or astruct, or an interface over two or more source files. Each source file contains a section ofthe class definition, and all parts are combined when the application is compiled. The classcorresponding to each vertex defines a set of input and output ports, and an execute method.The input and output ports are simply interfaces with no supporting implementation. Thisallows us to compile our application but it will produce a run-time error if we attempt toexecute it. After compilation we run a separate TokenNet code generator that uses .NETreflection to find all the vertex classes within the TokenNet application, identifies the inputports within these classes, and then generates the matching part of the partial class. Thegenerated code defines the implementation for each input port, including the code thatperforms the token matching across all the input ports. The files that are generated by thisprocess are added to the original solution and then the C# compiler is used once again torecompile the application. The result is an application that is completely define, and so canbe executed. The application of the TokenNet code generator, and the second compilationphase, can be driven by a post-build step within Visual Studio, and so the user is largelyunaware of it other than the increased compilation time.

The post-processor ensures that the arguments of the execute method match the nameand carrier types of the input ports. Although the arguments are still ordered in thework method this ordering is irrelevant; reordering the parameters would simply result ina reordering of the generated code. The edges are wired up by name, not position, as eachoutput port is explicitly linked to the appropriate named input port.

For vertex classes with no explicitly defined constructors the code generator can definea nullary constructor in the generated partial class that initializes the input and outputports. It would typically do this by assigning them instances of auxiliary port classes alsodefined by the generator, each one capturing the required matching semantics. The situationbecomes more complex where there are explicit constructors defined in the user code forthe vertex class. In this case we need a little help from the user, in the form of a call to amethod to initialize the ports from within the constructor. There are two choices here. If itis acceptable to derive all vertex classes from a common vertex base class then we can definean empty virtual method to initialize the ports within this class. The generated code wouldthen override this method for each vertex class, initializing the ports contained within theclass. If, on the other hand, requiring a common base class is too restrictive, noting thatin C# a class can only be derived from a single class, an alternative strategy is to rely on

28

Page 29: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

conditional compilation. We cannot simply call a method defined in the generated code,as this will not exist at the time of the first compilation pass. However, we can define acompilation symbol that is only defined during the second compilation. We can then requireusers to call the code to initialize the ports from their constructors, but wrap the call insidean #if/#endif section so that it only gets processed during the second compilation, atwhich point the corresponding method will be defined.

It is a simple task to implement the existing TokenNet API on top of a port-basedalternative. But there are other ways of packaging up this raw functionality that provideinteresting alternatives to the current approach. For example, we could define abstractionsto support structured graphs in the StreamIt style[13], using pipelines, split-joins and feed-back loops, and then call a link method to connect up all the ports in the resulting graph.This allows us to simplify the construction of the graph, when compared to the existingTokenNet approach, but delays the recognition of typing errors until the call to the linkmethod. For example, we might define

public interface StreamGraph { void Link(); }

public interface TypedStreamGraph<I, O> : StreamGraph, ... {}

and then define a PipelineStreamGraph<I,O> class derived from TypedStreamGraph. ThePipelineStreamGraph would maintain a list of StreamGraph instances, and so instantiatinga PipelineStreamGraph instance would be simple using C#’s collection constructor syntax,e.g.

new PipelineStreamGraph<TI, TO> {

new Vertex1(),

new Vertex2(),

...,

new VertexN()

}

Of course at this point there is no guarantee that the output port of a vertex in the pipelinematches the input port type of the next vertex in the pipeline. It is only at the point whenthe ports are linked together that such errors will be exposed. Defining an appropriate linkmethod in a type-safe fashion can be challenging. Fortunately a combination of genericmethods and dynamic types is sufficient for this task. For example, we might define

public static void Link<I, T, O>(TypedStreamGraph<I, T> upStream,

TypedStreamGraph<T, O> downStream) {

upStream.Output.LinkTo(downStream.Input);

}

and then implement the Link method within the PipelineStreamGraph by

public override void Link() {

foreach (StreamGraph subgraph in subgraphs) subgraph.Link();

for (int i = 1; i < subgraphs.Count; ++i) {

dynamic upstream = subgraphs[i - 1];

29

Page 30: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

dynamic downstream = subgraphs[i];

Link(upstream, downstream);

}

...

}

Note that the use of dynamic should not suggest that any form of dynamic type-checking isused during the execution of the TokenNet graph. All ports, and hence edges, are stronglytyped. We simply use the dynamic mechanism to establish the links. If an erroneouspipeline is constructed, where the typing is not consistent, this will result in the C# runtimebeing unable to determine the appropriate Link method to call, and an exception will begenerated.

7 Conclusion

TokenNets is a scheme for expressing parallel programs where, instead of saying how the pro-gram is to run in parallel, the programmer says how the parallelism needs to be constrainedin order to acheive correct results. We have built TokenPilot, a first implementation ofTokenNets/ Experiments so far indicate that this style of programming can lead to scalablyparallel programs.

TokenPilot only partially acheived one of our goals, namely making all other synchroniza-tion mechanisms superfluous. However, we propose to add structured tokens to TokenNetswhich will make the use of other synchronization mechanisms unnecessary in the use caseswe have identified so far.

We have written a significant measurement computation in TokenPilot, demonstratingthat TokenPilot graphs can be composed into full computations while retaining reasonableperformance.

References

[1] –, IEEE Standard on Transitions, Pulses, and Related Waveforms, IEEE Standard181-2003, 2003.

[2] –, “Intel Threading Building Blocks for Open Source,” Intel Corporation, http://www.threadingbuildingblocks.org. Accessed 8 December 2009.

[3] –, Parallel Computing Toolbox 4 Users Guide,, The WorkWorks, Inc., 2009.

[4] Blumofe, R. D., et al., “Cilk: an efficient multithreaded runtime system” in ACMSigplan Notices, vol. 30, no. 8, pp. 207-216, August 1995.

[5] Budimlic, Z., et al., “Multi-core implementations of the concurrent collections pro-gramming model” in CPC ’09: 14th International Workshop on Compilers for ParallelComputers, Springer, January 2009.

[6] Dean, J. and Ghemawat, S., “MapReduce: Simplified Data Processing on Large Clus-ters” in Proc. OSDI’04: Sixth Symposium on Operating System Design and Implemen-tation, San Francisco, CA, December, 2004.

30

Page 31: TokenNets: An Approach to Programming Highly Parallel ... · commerical successor Cilk++, OpenMP [7], and the Matlab Parallel Computing Toolbox [3] all provide capabilities for parallelizing

[7] Chandra, R., et al, Parallel Programming in OpenMP, Morgan Kaufmann, San Fran-cisco, 2001.

[8] Gelernter, D., “Generative communication in Linda” in ACM Trans. ProgrammingLanguages and Systems, vol. 7, no. 1, pp. 80-112, January 1985.

[9] Herlihy, M. and Shavit, N., “Work Distribution” in The Art of Multiprocessor Pro-gramming, Elsevier, San Francisco, pp. 381-392, 2008.

[10] Karp, A. H. and Flatt, H. P., “Measuring Parallel Processor Performance” in Comm.ACM, vol. 33, no. 5, May 1990.

[11] Petri, C. A., Kommunikation mit Automaten, doctoral dissertation, Universitat Bonn,1962.

[12] Reinders, J., Intel Threading Building Blocks: Outfitting C++ for Multi-core ProcessorParallelism, O’Reilly Media, Sebastopol, California, 2007.

[13] Theis, W., Language and Compiler Support for Stream Programs, doctoral dissertation,Massachussets Institute of Technology, 2009.

[14] Toub, S. and H. Shafi, “Improved Support For Parallelism In The Next Version OfVisual Studio,” in MSDN Magazine, October, 2008.

31