Raspberry Pi Application & Monitoring

Raspberry Pi Application & Monitoring

Agathangelos Stylianidis

August 26, 2016

Abstract

The presented project produced a framework for A* search algorithm that can utilisemultiple process elements. To implement the A* search, the state of the art approacheson parallel priority queues have been investigated. The framework was tested using atest case of the Travelling Salesman Problem on a Raspberry Pi cluster developed byEPCC (Wee Archie). During the framework evaluation the Wee Archie performancewas monitored and the resulting conclusions are presented.

i

Contents

1 Introduction 1

2 Literature Review 32.1 A* search algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Implementation of the A* search algorithm . . . . . . . . . . . 52.1.2 A* Search Algorithm Prior Work . . . . . . . . . . . . . . . . 6

2.2 Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1 Skip-List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 Spray-List priority queue . . . . . . . . . . . . . . . . . . . . . 102.2.3 Multiple, Relaxed Concurrent Priority Queues (MultiQueues) . 102.2.4 Lock-free k-LSM Relaxed Priority Queues . . . . . . . . . . . 11

2.3 A* Search Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.1 The Art Gallery Problem (AGP) . . . . . . . . . . . . . . . . . 132.3.2 The Travel Salesman Problem (TSP) . . . . . . . . . . . . . . . 15

3 Implementation 173.0.1 Raspberry Pi & Wee Archie Cluster . . . . . . . . . . . . . . . 17

3.1 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.1 The main algorithm . . . . . . . . . . . . . . . . . . . . . . . . 183.1.2 Framework Interface . . . . . . . . . . . . . . . . . . . . . . . 233.1.3 Framework test cases . . . . . . . . . . . . . . . . . . . . . . . 25

4 Monitoring 274.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Monitoring sessions Analysis . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.1 Initial data from idle state . . . . . . . . . . . . . . . . . . . . 284.2.2 Monitoring test case 1 . . . . . . . . . . . . . . . . . . . . . . 294.2.3 Second monitoring session . . . . . . . . . . . . . . . . . . . . 344.2.4 Monitoring Conclusions . . . . . . . . . . . . . . . . . . . . . 40

5 Conclusions 415.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

ii

Appendices 43

A AGP Test case 44A.0.1 Why AGP is not suitable as an A* test case . . . . . . . . . . . 47

iii

List of Figures

2.1 A* search algorithm example [22] . . . . . . . . . . . . . . . . . . . . 42.2 A* search algorithm example showing the adjacent nodes of the starting

node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 A* search algorithm example showing the solution of the minimum cost

path from the starting node to the target node . . . . . . . . . . . . . . 52.4 Binary Search Tree Implementation . . . . . . . . . . . . . . . . . . . 82.5 Skip List representation of the binary tree in Figure 2.4, [1] . . . . . . 92.6 Example of a search operation in a Skip List . . . . . . . . . . . . . . . 92.7 Insert operation using merging in LSM queues [27] (this figure is used to

present the insert operation of LSM queues, but this specific example may never occurin real seductions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.8 example of y-monotone polygon, the dashed line shows the intersec-tion, of the polygon with line `v . . . . . . . . . . . . . . . . . . . . . 14

2.9 Example of x-monotone polygon, the dashed line shows the intersec-tion, of the polygon with line `v . . . . . . . . . . . . . . . . . . . . . 14

2.10 Example of a triangulated polygon . . . . . . . . . . . . . . . . . . . . 15

3.1 Example of the framework reduce operation . . . . . . . . . . . . . . . 193.2 This example shows a situation where the PE1 locked the access to Q1 . 203.3 This example shows a situation where the PE1 locked the access to Q1

and Q2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Priority Queue Interface . . . . . . . . . . . . . . . . . . . . . . . . . 223.5 Support Class interface . . . . . . . . . . . . . . . . . . . . . . . . . . 243.6 Support Class interface cont. . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.4 Raspberry Pi 1 CPU core 1 . . . . . . . . . . . . . . . . . . . . . . . . 304.5 Raspberry Pi 1 CPU core 2 . . . . . . . . . . . . . . . . . . . . . . . . 314.6 Raspberry Pi 1 CPU core 3 . . . . . . . . . . . . . . . . . . . . . . . . 314.7 Raspberry Pi 1 CPU core 4 . . . . . . . . . . . . . . . . . . . . . . . . 324.8 Memory usage of Raspberry Pi 01 . . . . . . . . . . . . . . . . . . . . 334.9 Wee Archie network traffic . . . . . . . . . . . . . . . . . . . . . . . . 344.10 Raspberry Pi 1 CPU core 1 . . . . . . . . . . . . . . . . . . . . . . . . 35

iv

4.11 Raspberry Pi 1 CPU core 2 . . . . . . . . . . . . . . . . . . . . . . . . 354.12 Raspberry Pi 1 CPU core 3 . . . . . . . . . . . . . . . . . . . . . . . . 364.13 Raspberry Pi 1 CPU core 4 . . . . . . . . . . . . . . . . . . . . . . . . 364.14 Wee Archie Memory usage . . . . . . . . . . . . . . . . . . . . . . . . 374.15 Memory usage of Raspberry Pi 1, at monitoring session 2 . . . . . . . . 374.16 Network usage of Raspberry Pi 1, at monitoring session 2 . . . . . . . . 38

A.1 The initial polygon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45A.2 The initial polygon splited to y-monotone polygons . . . . . . . . . . . 46A.3 The triangulation of the initial polygon . . . . . . . . . . . . . . . . . . 47A.4 The A* search is ready to start from node µ . . . . . . . . . . . . . . . 48A.5 Colour set for at triangle vertexes where µ is in . . . . . . . . . . . . . 48A.6 A* search is ready to explore node ⌫ and colour vertex v1 . . . . . . . . 48

v

Glossary

9-puzzle Is a sliding puzzle of square divided at 9 (3 x 3) smaller squares, and one ofthe 9 squares is missing. The goal is to place the existing squares in a particularorder.. 18

Best-First Search Best-First search is an algorithms category constraining algorithmswho are trying to sole a search problem, trying to find a solution using a heuristicfunction. 3

Close list of A* Close list keeps track of all of the states that have been explored, toavoid repeating the same operations again. For the close list a data structure likea hash-map can be used.. 3, 6

Minimum Spanning Tree at description. 25

Open list of A* Is a data structure that A* uses to track stated that reached but theydid not searched yeat. As a data structure should combine the characteristics ofPriority Queues and fast access to the contained elements. Its needs the charac-teristics of Priority Queues because it should be able to insert items and operatethe DeleteMin operation, where its returns the highest priority element. Its alsoneeds the to have fast access at stored elements because its needs to respond fastto queries, like, if an elements exist in the open list.. 3, 6, 19

Search Space Search Space contains all the possible instances of a problem. 3

TSP Travelling Salesman Problem. 3, 6, 13

vi

Chapter 1

Introduction

Search algorithms are widely used in artificial intelligence, navigation, game playing(for example at chess). This dissertation is about the development of a framework theA* search algorithm enabling users with minimal knowledge on parallel programming,to apply a parallel A* search on their own problem. The framework provides the inter-face to implement a search problem and hiding the parallel A* implementation from theuser. Although the framework internals are supplied to the user through the documen-tation in order to have better understanding on what is happening inside, and customizeit according to its needs.

As the implementation platform a Raspberry Pi cluster had been used. More specif-ically, a Raspberry Pi cluster call Wee Archie, developed by EPCC, was used in thisproject. The Raspberry Pi is a platform developed for educational purposes, combiningaccessible programming environments with low cost. At the time of writing1, RaspberryPi board will cost approximately 30 British Pounds, making it attractive to researchersand enthusiasts, with limited budgets. The power consumption is 4 Watts providing alow runtime cost even when used in cluster environments. Because of its low cost is anattractive platform to be used at experimental and educational clusters.

Wee Archie is used by EPCC at science festivals and in schools [10] to demonstratethe concepts of parallel programming in an accessible form. The current work could beused as during exhibitions to give the opportunity to the audience to solve a problemusing Wee Archie.

During this work a parallel framework of A* algorithm was developed that runs on aRaspberry Pi cluster. The implementation deals with real parallel programming prob-lems, like implementing a parallel priority queue. Afterwords, the framework used toimplement a solution of the TSP. While testing the TSP implementation the cluster wasmonitored and monitoring data have been analysed.

The developed framework can be used to solve in parallel simple problems, such assolving Sudoku [2] or Survo puzzles [4], or to tackle more complex problems like the

119/08/2016

1

Travelling Salesman Problem which can be used in route planning or electronic circuitdesign.

To cope with the previous mention problems (Sudoku, Survo, TSP), the developedframework, uses an A* variant algorithm with underlying main data structure beinga Priority Queue. Queues are one of the real life problems at parallel computing. Isdifficult to implement an efficient queue at that uses multiple PEs. If is hard to im-plement a simple queue (which consider as a trivial data structure at single threadedprogramming) is much harder to implement a parallel priority queue. Section 2.2 dis-cuses the latest approaches at parallel implementations on priority queues and Section3.1 discuses and justifies the implemented approach.

When the implementation was ready it was tested on Wee Archie with a monitoringapplication collecting data of Wee Archie operations, in order to understand better howthe Wee Archie deals with the implemented application.

2

Chapter 2

Literature Review

The core of the project concerns the implementation of a parallel search framework. Tothis end, the theory of A* searching and supporting structures such as priority queue willbe discussed in this chapter. The chapter concludes with an overview of the problemsbeing used to evaluate the search framework.

This thesis proposes a variation of the A* search algorithm based on MultiQueue, con-current priority queues. The goal of the proposed algorithm is to provide a parallelimplementation based on A* search. The implementation is evaluated using the Trav-elling Salesman Problem (TSP).

2.1 A* search algorithm

A* is a meber of Best-First Searchs algorithm group [17], which utilises greedy and/orheuristics search to find solutions with the lowest cost in a Search Spaces. The A*algorithm can be used in problems that search for the low cost paths such as telecomtraffic rooting, network traffic routing, maze navigation [17]. A classic problem this hasbeen applied to, is the TSP. This final problem (TSP) will be discussed later in moredetails.

The central concept of A* search algorithm is to keep track of the minimum cost pathafter each step. Assuming that, having two points (start and target point) as seen fromFigure 2.1, the blue node is the starting point and the red node is the target point. Be-tween those two nodes there is a wall (black square). The goal of this example is to findthe best low cost path (fewer nodes) from the starting node to the target node. To ac-complish this A* search algorithm uses two lists referred as Open list of A*s and Closelist of A*s. Open list is used to add all the nodes that are candidates to visit during thenext steps while the close list consists all the nodes that are visited and are probablyincluded in the final path. The nodes in the open list are all the nodes that were reachedbut are still not explored. As Figure 2.2 shows the adjacent nodes (candidates for thenext step) of the starting node appeared with grey colour.

3

Figure 2.1: A* search algorithm example [22]

Figure 2.2: A* search algorithm example showing the adjacent nodes of the starting node

The search ends when A* tries to explore a target node. The A* algorithm uses the 2.1as heuristic function to evaluate each state. The value return from the heuristic functionwill be used to prioritise the states at open list.

f(n) = g(n) + h(n) (2.1)

Where

• n is the candidate new node to be added in the close list

• f(n) is the minimum estimated cost from the starting node passing from node ntowards to the target node

• g(n) is the cost to reach node n from the starting node

• h(n) is the minimum estimated cost from node n to the target node based on someheuristics.

Finally, Figure 2.3 shows, with green colour, the minimum cost path from the startingnode to the target node as estimated from A* search algorithm. The pseudocode for A*algorithm is shown Algorithm 1.

4

Figure 2.3: A* search algorithm example showing the solution of the minimum cost path from the startingnode to the target node

Algorithm 1 A* search algorithm pseudocodeInput: Problem Initial State (Node root)Output: Problem Solution

1: Node currentNode root2: openList3: closeList4: openList.add(currentNode)5: while (currentNode openList.pop()) 6= NULL do6: if isSolution(currentNode) then7: return currentNode8: else9: successorStates[] expand(currentNode)

10: for each succ in successorStates do11: if !openList.contains(succ) AND !closeList.contains(succ) then12: openList add succ13: else if openList.contains(succ) with lower f(n) then14: replace old instance with succ15: end if16: end for each17: end if18: end while19: return NULL //No solution was found

2.1.1 Implementation of the A* search algorithm

A* search algorithm can be implemented using an abstract tree representation of all thesearch space. A main loop iterates until it finds the target node. (For the purpose ofthe current explanation we will make the assumption that the open list is implementedusing a priority queue). A priority queue is used to represent the open list. Priorityqueues are used as low cost queues based on a given priority related to each element

5

that is added in this queue. Every time when the f(n) function is estimated, node nis removed from the priority queue and the new g(n) values of the n node are addedin this queue. Priority queues are described in section 2.2. The time complexity ofA* search algorithm is based on the heuristic function h(n) and the g(n) function [20].More specifically, the time and space complexity is O(bd), where d is the number ofnodes in the shortest path and b is the number of adjacent nodes in each step. In theworst case, the space is growing exponentialy.

2.1.2 A* Search Algorithm Prior Work

A* search algorithm initially had a sequential implemention. Subsequently, differentsequential and parallel implementations have been proposed. This thesis is focused onparallel implementations. There are a lot of variations in parallel implementations ofA*. This section discusses the most relevant implementations to this work.

One of the earliest works [32], referred as PRA*, suggests the use of parallelising theprocess of the open and close lists and assign each list to a separate processor. When anode is added in the close list, there is a synchronization between the processors.

PLA* [13] is parallelising the process but in this case each processor has its own openand close lists. The authors propose a novel parallel startup phase and efficient dynamicwork distribution strategies. They use work transferring as a means to tackle the prob-lem of duplicate searching by different processors. The time complexity of algorithmA* is reduced to O(log(x)). They evaluate this parallel implementation of A* algorithmusing TSP. TSP is detailed in Section 2.3.2 and will be used as test case.

Another work called as Transposition-table driven work scheduling (TDS) [30] inte-grates parallel search algorithm with a transportation table. TDS is used to distribute thework for the A* search algorithm. TDS makes all the communications asynchronous,overlaps communication with computation, and reduces search overhead. To accom-plish this, TDS sents the work to the processor that manages the associated transporta-tion table entry. When this is done, the processor can process additional work withouthaving to wait for any results to come back. This results to asynchronous communica-tion and reduces the overheads.

Another paper [18], referred as Hash Distributed A* (HDA*), combines the idea ofPRA and TDS. HDA* was implemented and tested using the Message Passing Interface(MPI) [6]. It has an Open list of A*s and a Close list of A*s that are distributed acrossall of the available processors. Each processor is responsible for a portion of the searchspace. The algorithm is based on a hash function to split the search space and divide itacross the processors. Each processor runs a loop until the optimal solution is found.This loop is split into two phases. During the first phase each processor has to check ifa new state has been received in its message queue. In the case that the state is received,the possessor is checking if the same state exists at its own close list and if not, ispushing it in the open list. On the other hand, during the second phase, the processorgets the highest priority element from its private open list and expands it. With the

6

expansion of a state, new states are generated and a hash value for each of them iscomputed. Then the hash value of each state is send to the processor that is responsiblefor that hash value.

These A* implementations were developed for CPU platforms. There are some im-plementations of the A* search algorithm that can use graphics processing unit (GPU)platforms. The work in [38] proposes a parallel variant of the A* search algorithmable to run on GPU processor in a massively parallel fashion, called GA*. This workshows the GPU parallel based implementation of A* can provide a 45x speed-up thanthe sequential implementation of A* in a CPU or GPU platform.

2.2 Priority Queues

The A* search algorithm uses a data structure which suports priority queue operationsand provides fast response to queries, like if an elements is in the priority queue or not,to represent the open list.

Priority queues are storing elements based on a given priority assigned to each elementadded in this queue. There are various works that are related on the priority queues.In [19] Knuth describes the priority queue as a data structure where each element addedin this structure with a key (priority value).According to Knuth’s definition, priorityqueues have to provide two operations insert and deleteMin. The insert operation insertsthe element in queue, and the deleteMin extracts the element with the highest priority.Key is an attribute related to the inserted element. In this case, key is the value of f(n)function.

The element that is removed first from the queue is the one with the lowest key value.The first element to be removed from a priority queue is the one with the lowest keyvalue. Similarly,using priority queues to implement the A* search algorithm, each timethat a state is added in the priority queue an f(n) value is estimated and used the asso-ciated key to stored the state in the queue. The first state that will be removed from thequeue is the one with the minimum f(n) value (associated key). The parallel imple-mentations of the A* search algorithm, that described before, face the problem wheresome sequential parts cannot be parallelized [38]. One of this parts is the basic opera-tions (search, insert, delete) of the priority queue. For those applications in which thecomputation of heuristic functions is relatively cheap, the priority queue operations willbecome the most time-consuming parts in A* search. There are several solutions forthis problem. One of the solutions is to use lock-free concurrent priority queues. Lock-free implementations guarantee that regardless of the contention caused by concurrentoperations (search, insert, delete) and the interleaving of their sub-operations, always atleast one operation will progress [34].

Below, are described four of the commonly used concurrent priority queues.

7

2.2.1 Skip-List

Concurrent priority queues can be designed by using skip list rather than heap datastructures. In [33], the authors propose to have a concurrent priority queue based onskip-list. Skip-lists are linked-lists with the characteristic that each node can have refer-ences to multiple nodes ahead instead of one. Skip-lists keep the elements in an orderedlist, but each element in the list is a part in a number of sub-lists.

Assuming that we want to represent a binary search tree illustrated in Figure 2.4 usingskip-list. As the Figure shows the binary search tree has a minimum value equal to 1and a maximum value equal to 17. Binary search tree keeps the values in each node ina sorted order (from left to right). When looking for a specific value in the tree, the treehas to be traversed from the root to leaf, making comparisons to the sorted values ofeach node. The comparison specifies which sub-tree the searching will be continued in.

8

5

3

1

7

6

12

11

9

17

15

Figure 2.4: Binary Search Tree Implementation

Figure 2.5 shows the skip-list representation of the binary search tree in Figure 2.4.As can be seen from the Figure the lower level consists of all the nodes in a sortedorder [1]. Moving to the upper layers there is a skipping function. In this example theskip function is by two elements in each layer.

Search Operation in the Skip List

Search operation in skip list starts from the head and proceeds horizontally. In the casethat the searching element is equal to the current element then the search operationends successfully. On the other hand, if the target element is smaller than the currentelement then the search continues horizontally to the next one. In the case that theelement in the list is greater than the searching element then the procedure is repeatedafter returning to the previous checked element and dropping down (moving vertically)to the next lower list (lower layer). Lets assume that the searching element is 6 from theprevious example in Figure 2.5. The search operation starts from the top layer, layer 5.In this layer the first node has the value�1 and the last has the value1. So the searchcontinues vertically to the lower layer, layer 4, where the first node has the value 1 and

8

�10

�11

�12

�13

�14

�1

head5

1

1

1

1

1

3 5

5

6 7

7

7

8 9

9

10 12

12

12

12

15 17

17

1

1

1

1

1

1

tail

21 21 21 21 21 21

22 22 22

23 23

24

25

Figure 2.5: Skip List representation of the binary tree in Figure 2.4, [1]

the last 1. Because there is no node with value greater than the searching element(6), the search continues to the lower layer from node with the value 1. In this casethe searching element is smaller than 12 and the search process continues to the lowerlayer, layer 2. In this layer the node with the value 6 is smaller than 7 and bigger than 1so it goes vertically from node with value 1, to layer 1, where the searching element islarger than the node with value 5 and smaller than the node with value 7. So it continuesvertically down to layer 0 from node with value 5 and finally it reaches the node withvalue 6. The final search path is illustrated in Figure 2.6.

�10

�11

�12

�13

�14

�1

head5

1

1

1

1

1

3 5

5

6 7

7

7

8 9

9

10 12

12

12

12

15 17

17

1

1

1

1

1

1

tail

6>1 & 6>5 6>1 & 6<7

6>1 & 6<7

6>1 & 6<12

6>1 & 6<1

6>-1 & 6<1

Figure 2.6: Example of a search operation in a Skip List

9

Insert and Delete Operations in the Skip List

Insert and delete operations are similarly implemented like linked lists with the dif-ference that elements must be inserted into or deleted from more than one layers [1].During the insert operation the following steps have to be implemented:

1. Search for the appropriate position to insert the new element.

2. Insert the new element to the lower layer.

3. Insert the new element to the upper layers if it is necessary.

Similarly, during the delete operation the following steps have to be implemented

1. Search for the element that has to be deleted.

2. Delete the element from the current layer.

3. Delete all the occurrences of this element from the SkipList.

SkipLists can be accessed either sequentially or parallel. In a multithreaded implemen-tation each thread can allocate more than one layers to provide an operation (search,delete, insert). The main bottleneck of Skip-List implementation is on the DeleteMinoperation. DeleteMin is the operation that a delete process takes place on the min-imum value of the list. In the Skip-List all the running threads have to decide whogets the node with the minimum value. This results to contention and limited scalabil-ity [23]. The next subsection explains how SprayList can overcome the problem relatedto DeleteMin of SkipList.

2.2.2 Spray-List priority queue

SprayLists [7] are an improvemennt on skip-list using an alternative algorithm for per-forming the DeleteMin operation. The DeleteMin operation avoids the sequential bot-tleneck of SkipList by “spraying” all the threads onto the head of the SkipList. Sprayingis implemented using a carefully designed random walk (random access to the nodes),so that DeleteMin returns an element among the first in the list, with high probability.This work shows benefits from the scalability of the threads [7].

2.2.3 Multiple, Relaxed Concurrent Priority Queues (MultiQueues)

Another implementation of concurrent priority queues is the Multiple, Relaxed Concur-rent Priority Queues, referred as MultiQueues implementation [29]. In this implemen-tation DeleteMin operation reduces the performance overhead using multiple sequentialpriority queues

More specifically, this implementation uses multiple sequential priority queues. Theidea behind this implementation is to have cp sequential priority queues, where c is a

10

tuning parameter (c > 1) and p is the number of parallel threads or number of processingelements (PEs). Having more queues than threads, ensures that the contention duringthe operations (delete or insert) remains small. In this implementation, each queue hasa representative sample of all the elements.

Each queue is protected by a lock. When a process element wants to access a queue,is checking among the available queues and access the first available queue (availablequeue: a queue that is not locked). However, using the locking can induce seriousperformance overheads. For this MultiQueues implementation takes advantage by usingthe relaxed operation. Relaxed operation can relax the procedure by determining that itcan delete an element with the non minimal value in the case that it cannot delete theminimum value element.

It is important to note here that the operation is characterized as wait-free because itnever wait for a locked queue. Below, each operation is described in more details.

Insert Operation:

1. The PE chooses the first available queue that is not locked.

2. Lock queue.

3. Insert element.

4. Unlock queue.

Delete Minimum Operation

1. The PE choose the first two available queues that are not locked.

2. Lock queues.

3. Compare minimum key value items of the two queues.

4. Execute delete minimum operation on the queue with the minimum key value.

5. Unlock queues.

Delete operation chooses two queues instead of one to avoid major fluctuations in thedistribution of the queue elements.Our proposed framework is based on the MultiQueues implementation.

2.2.4 Lock-free k-LSM Relaxed Priority Queues

Another, implementation of the concurrent priority queue is the relaxed, lock-free pri-ority queue [37]. This implementation uses a logarithmic number of sorted arrays,containing keys, similar to log-structured merge trees (LSM) that are commonly used

11

in databases [27]. The main idea of this implementation is that it relaxes the basic op-erations such as insert and DeleteMin. For example in the DeleteMin operation, it isallowed to delete any of the k+1 smallest key values visible to all the threads, where kis a parameter determined during run-time. This implementation is referred as k-LSMpriority queue it is combining the concepts of relaxation and LSM tree arrays.

LSM trees maintain the data in two or more separated structures (trees) [27]. The data inthe separated structures can be synchronized efficiently. Typically, this implementationstores the smaller trees in the memory (DRAM) and the larger trees in the disk. Theinsert operation is applied in the small tree and when the size of the tree exceeds acertain size then some elements are removed from that tree and are merged into thelarger tree.

Usually, LSM trees are represented with multiple levels (l) and blocks (b). Each levelhas one block and the block of the ith level can store 2l�1 < n < 2l number of elements.When two levels have the same number of elements, because of delete or insert ofelements, they merge together and they are creating a new level. Also, if a level containssmall number of elements is shrank to a lower level. An example of insert and mergeoperation is shown in Figure 2.7. The blocks in Figure 2.7 are shown with rectangle.Different levels can be distinguished from the different lines that join the blocks. Level0 is consisting of the block with value 13, Level 1 in consisting of the blocks with values11 and 4 and so on. As can be seen from the Figure, a single block with value 8 at level0 is appended to the LSM. Merges are then performed from the end until no two blockswith the same level exist.

18 12 11 9 7 3 11 4 13

18 12 11 9 7 3 11 4 13 8

18 12 11 9 7 3 11 4 13 8

18 12 11 9 7 3 13 11 8 4

Figure 2.7: Insert operation using merging in LSM queues [27] (this figure is used to present the insertoperation of LSM queues, but this specific example may never occur in real seductions)

The advantage of LSM trees is that each level can be stored as an array and they canexpose a higher number of sequential accesses in the memory. Because of that theycan expose cache-efficiency. In addition, because each array is sorted, the merge canbe done in a linear time complexity, more specifically O(n) where n is the number ofelements of the bigger array participating in the merge operation.

At [37] they propose three versions of k-LSM priority queues. The two of them are,the Shared k-LSM priority queues and the Distributed k-LSM priority queues. Thethird one is a combination of the first two. In the shared k-LSM priority queues, all thethreads can access the array of pointers to blocks in the LSM by using a single pointer.Each update to the LSM queue will automatically replace the pointer to the array by apointer to a new array containing all the modifications.

12

On the other hand, in the Distributed k-LSM priority queues each thread has its own lo-cal k-LSM priority queue for applying the different operations. When a priority queue,that belongs to a thread, is empty, the thread is peaking an element from another threadqueue without removing this element. This techniques is referred as spying.

Shared k-LSM queues and Distributed k-LSM queues can be combined into a “hybridpriority queue”. This new, hybrid priority queue consists of a distributed LSM priorityqueue per thread, a single shared k-LSM priority queue and an array used for spying.During the insert operation, the new element is added in the distributed LSM queue. Inthe case that, after performing all the necessary merges, the resulting block is biggerthan the certain size then the block is added in the shared k-LSM priority queue.

On the other hand, during the DeleteMin operation both, shared and distributed priorityqueues are checked for the minimum value. The minimum of the both queues is markedas deleted. In the case that both queues are empty, spy is called.

2.3 A* Search Problems

This thesis implements a variation of the A* search algorithm by using MultiQueuesalgorithm and evaluates this implementation according to the performance gains byusing different applications.

This Section describes the two problems that are used for evaluation of our implemen-tation. The first problem referred as the ART Gallery Problem (AGP) [21] is used tofind the minimum number of guards that are needed to protect an art gallery. The otherproblem referred as TSP [8, 24] is used to find the shortest path from a city to anotherby visiting each city exactly once.

After our implementation we found out that the AGP problem is not efficient to beimplemented with the variation of the A* implementation, and for this reason we arenot going into details about this problem.

2.3.1 The Art Gallery Problem (AGP)

The Art Gallery problem is used to determine the minimum number of guards thatare needed to protect an art gallery place. There are various variations of the originalproblem that was proposed by Victor Klee in 1973.

The problem takes as input a 2D map (floor plan of the place) with the art gallery placeand returns again a 2D map consisting the position of the guards on it. The guardshave to protect the whole gallery. Besides the theoretical interest, there are some otherpractical problems that can be solved using AGP solution, such as guarding a place withthe minimum number of security cameras or illuminating a place with few lights.

13

The work in [12] summarizes all the effort that has been done according to the artgallery problem.

The solution of the Art Gallery Problem is described in the following steps [11]:

1. The place has to be modelled as a monotone polygon. A polygon is monotoneaccording to line ` if the intersection of the polygon for every vertical line of ` isa line segment or a point or void. For the purpose of this work we are interestedat y-monotone and x monotone polygons. For y-monotone polygons the ` lineis equal to the y-axis and for the x-monotone polygons the the ` line is equal tothe x-axis. Figure 2.8 shows a y-monotone polygon and a Figure 2.9 shows ax-monotone polygon.

In the case that the polygon is not monotone, then it is split to multiple monotonepolygons. This can be done in O(n) space complexity, were n is the number ofvertices in the polygon.

`v

x

y

Figure 2.8: example of y-monotone polygon, the dashed line shows the intersection, of the polygon withline `v

`v

x

y

Figure 2.9: Example of x-monotone polygon, the dashed line shows the intersection, of the polygon withline `v

2. All the monotone polygons have to be triangulated. This can be done by drawingdiagonals between pairs of vertices only in the interior of each polygon. Based

14

on a theorem presented in [11], every single polygon admits a triangulation andany triangulation of a simple polygon with n vertices consists of exactly nâLŠ2triangles. A monotone polygon can be triangulated in O(n) time, so the overalltime complexity starting from a simple polygon to a triangulation is O(n log n),where n is the number of vertices. Figure 2.10 shows a monotone polygon that istriangulated.

Figure 2.10: Example of a triangulated polygon

3. As soon as the triangulation of the polygon is done, an effort to compute theminimum number of guards has to be made. This can be done by using the 3-coloured method. This method uses only three colours to draw all the vertices.The colours are chooses in that way that every individual triangle has no twovertices coloured with the same colour. Finally, the guards are placed in thevertices using a particular colour only. Since, each triangle has exactly 3 differentcolours, choosing one of them give us one guard for each triangle. This placementensures that the guards can see the whole gallery.

A solution of AGP can be found by using the Breadth-First Search algorithm. Although,A* search algorithm can by used to solve this problem, it cannot reach better solutionthan the BFS algorithm.

2.3.2 The Travel Salesman Problem (TSP)

TSP is one of the most well studied problems at computer science and operationalresearch. Since 1970, a large amount of research has focussed on solving the travellingsalesman problem in more efficient and optimal ways.

TSP problem statement Having n cities, which is the most optimal path by passingfrom all n cities, only once, and returning back to the start city, with the minimum cost.Cost is defined as the cost to travel from city n1 to city n2 (where n1 and n2 can be anycombination of cities).

The computational complexity of TSP problem is NP-Hard [14, 15]. Problems at NP-class (NP-complete problems belong also to NP class) can verify a solution in poly-

15

nomial time. For example as mention before, the 3-coloring problem belongs to NP-Complete class. In the case that there is a solution for the 3-coloring problem it can beverified immediately if it is a right solution or not. To verify it has to be checked thatthere is no edge connecting 2 vertices of the same color.

For the TSP problem to find if there is a solution the verifier needs to check all of thepossible solutions and then decide if the provided solution is the optimum.

Travel Salesman Problem is a problem that can be solved with the A* algorithm forsmall instances. This problem is used to determine the shortest path from a startingcity to a destination city by passing all the cities in the path exactly once. TSP can bemodelled as a weighted graph, such that the cities are the vertices, paths are the graph’sedges and a path distance is the edge’s length. Each vertex has a connection edge withall the other vertices.

There are two kind of TSP problems, the asymmetric and symmetric. In the symmetricTSP, the distance between two vertices (cities) is the same in each opposite direction,and all the cities are directly connected between them. On the other hand, in the asym-metric TSP, the distance might be different in both directions between two cities, and itis not nessesary that all the cities are directly connected between them.

The TSP is an NP-Hard problem, which means that any algorithm that computes asolution of this problem requires an amount of computation time which is polyonimianby a non-deterministic Turing Machine [14, 15].

Our intention in this thesis is to use the TSP as a case study for the proposed variationof the A* search algorithm.

16

Chapter 3

Implementation

This Chapter presents the framework developed in this work. The main goal of theframework is to provide an interface for an A* implementation that runs on Wee Archie.A second goal is to monitor Wee Archie resources while the framework is in use. Theframework can be used for educational purposes related Raspberry Pi, illustrating theuse of parallel programming across a cluster in an accessible way for education andtraining purposes.

In summary, this chapter introduces Raspberry Pi and the Wee Archie cluster. Secondly,it introduces the A* framework. It then presents the application that will be used forthe monitoring and finally, it introduces the different monitoring techniques used in thiswork.

3.0.1 Raspberry Pi & Wee Archie Cluster

In this implementation we use the Wee Archie cluster developed by EPCC. Wee Archieconsists of 18 Raspberry Pi 2 boards. Each Raspberry Pi has, a CPU with 4 coresrunning at 900 Mhz, a GPU, 1 GB of RAM and an 8 GB MicroSD card for storingspace. Wee Archie’s Raspberry Pi boards are connected by Ethernet and 3 switches. Tooperate the cluster needs 61 watts with the LEDs, and 40 watts without them. One ofthe most interesting features of the Raspberry Pi 2 processor is that it follows the SOCarchitecture and gives the opportunity of experimentation with SOC systems.

MPI can run on systems with supporting implementation from commodity personalcomputers to cluster computers. Certainly, it is preferable to run MPI in a cluster com-puter, with many cores, to take advantage of more of the parallelization. Getting accessto a cluster computer is relatively expensive. On the other hand, building a relativelygood cluster of 18 CPUs (like Wee Archie) will cost more than 10000 British Pounds.

17

3.1 Proposed Framework

In this Chapter we introduce the proposed framework for providing an efficient parallelvariation of A* algorithm in a specific platform. This implementation can reduce thetime needed to execute the main operations of the A* algorithm. Also, the frameworkcan provide the opportunity to new programmers to exploit their programming skills.

3.1.1 The main algorithm

The implemented algorithm is a variation of the A* algorithm. Algorithm 2 shows ourproposed algorithm’s simplified version pseudocode.

Algorithm 2 Framework main algorithm pseudo-code code (Simplified version)Input: Problem Initialize Open-list (openList)Output: Problem Solution

1: Node currentState2: while (currentState openList.pop()) 6= NULL do3: if isSolution(currentState) then4: return currentState5: else6: newStates[] expand currentState7: for each new in newStates do8: openList add new9: end for each

10: end if11: end while12: return NULL //No solution was found

This implementation behaves like a best-first search A* but without using the close list.Close list consists of all the nodes that are investigated and visited.In this implemen-tation we decided not to account for the close list to minimize the memory overheadsin the platform. To evaluate our framework implementation we use a Raspberry Piplatform which is a memory limited system. Maintaining a close list in a device likeRaspberry Pi it would be a large memory bottleneck, since the Raspberry Pi has only 1GB of RAM memory. So, not accounting for the close list we reduce the overhead ofthe memory.

Although, not accounting for the close list can reduce the memory overheads, the miss-ing of close list makes the framework not suitable to all problems. For example, prob-lems with repeated states (9-puzzles) need to have a close list to save the states that arevisited to avoid any repetition of the same state.

The design goal of the proposed algorithm is to make the available processes to conver-gence towards the solution. The work sharing between the processes has 2 phases.

18

1. The processes communicate to decide which process has the most promisingstate1.

2. The process with the most promising state is sending promising-derived states tothe other processes.

The decision of which process will send states to the other processes is done by execut-ing a non-blocking all reduce(MPI_Iallreduce()) operation. MPI_Iallreduce()is an expensive communication operation [16]; it’s complexity is presented in equation3.1

dlog2pe↵ + 2p� 1

pn� +

p� 1

pn� (3.1)

↵: the amount of sending messages.

�: the inverse of the bandwidth (bandwidth is byte per time)

�: the inverse of the time needed to perform an arithmetic operation.

n: the amount of data items.

Each processor is responsible to share it’s rank and the heuristic value of its best state.This is used to find the processor with the best heuristic value. To do this the processorsuse a reduction function. The reduction function compares the heuristic values of allthe processors and returns the rank of the processor with the best heuristic value. Thenthe processor with the best heuristic value sends it’s states to all the other processors inorder to share it’s states. Figure 3.1 is an example of the reduce operations performed bythe framework. In example 3.1 four process are operating a reduce operation. Each ofthem, is sending its rank and a heuristic of their value that they peak, from their queues.The reduction function compares the heuristic function returns the rank number and thehighest heuristic value. Example 3.1 return the rank 0 and the value of 5. Afterwardsthe rank 0, will peak some of its states, and send them to other processes.

0 0, 5 1 0, 5 2 0, 5 3 0, 5

Reduce Function

0 0, 5 1 1, 14 2 2, 15 3 3, 37

Figure 3.1: Example of the framework reduce operation

1promising state = the highest priority state

19

Q1 Q2 Q3 Q4

PE1 PE2

Figure 3.2: This example shows a situation where the PE1 locked the access to Q1

The framework implements the Open list of A*s using a MultiQueue algorithm thatassumes a set of relaxed operations. Because of relaxed operations, the reduce-all op-eration is also a relaxed operation.

The framework is not only taking advantage of the available processes but also takesadvantage of the multi-threading. To share work among threads the algorithm uses theMultiQueue data structure presented in 2.2.3. There are several reasons that we chooseMultiQueues for the implementation among multiple threads and multiple processes. Areason which led to the choice of MultiQueues is its independents from the underlinepriority queue data structure.

Figures 3.2 3.3 shows examples of situations that can occur at framework run time.

Figure 3.2 shows that PE1 locked the access at Q1. Assuming that PE2 wants to pushan element at the queues and choose to push at Q1 it will try to obtain the lock of Q1

but it will fail. After its fail attempt it will choose another queue and try to lock it. Itwill chooses queues in random, until its lock with success one of the available queues.

Figure 3.3 shows that PE1 locked the access at Q1 and Q2. If PE2 wants to get an ele-ment from the queues its operation will complete when its manege to lock two queues.If the situation is the one at Figure 3.3 its operation will complete when its manege tolock Q2 and Q4.

In other words, it can use any serial priority queue implementation. This gives theflexibility to the user to implement its own priority queue data structure, and supplyit to the framework.To invoke the necessary functions, the priority queues, have to beconsisted with the priority queue interface describe in Figure 3.6.

Other important reasons that lead to choose the MultiQueue data structure over the otherdata structures (k-LSM, SkipList, SprayList) are:

SprayList: SprayLists as shown in [7] are not as scalable as MultiQueues. Although,the proposed implementation is not intended to use more than 3-threads (becausethe Raspberry Pi CPU has only 4 cores), we need to have scalability for usage

20

Q1 Q2 Q3 Q4

PE1 PE2

Figure 3.3: This example shows a situation where the PE1 locked the access to Q1 and Q2

on other platforms. This makes our implementation reusable, customizable andexpandable. Also, SprayLists do not free the memory of the removed elements(a scalable cleanup for SprayList is an open problem), and since the memoryrequirements for A* are growing exponentially (O(bd)) this data structure wouldlead Raspberry Pi to run out of memory very early.

k-LSM: Although k-LSM priority queue algorithm is performing better than Multi-Queues on certain tests, the k-lsm implementation is much more difficult 2. thanMultiQueues. Also, a part of the k-LSM performance relies on the LSM datastructure. The LSM data structure, described in Section 2.2.4, provides cacheefficiency. This characteristic can be also added to the MultiQueue algorithm ifthe underlying data structure is a serial cache-oblivious priority queue [9] or apriority queue based on Van Emde Boas trees [36].

Algorithm 3 presents the main functionality of the implemented algorithm. The inputarguments of the function are provided by the class. The function stores the best foundsolution at a class field. Then, the processItem function accesses the MultiQueue to getitems and then process them. The processing of an item consists of two states. The firststate checks if the item is a solution and the second returns the solution. If the item isnot a solution then it generates the successor states of that item and adds them to thequeues. The Algorithm 3 presents the version where the master thread executes onlythe communication part of the program. There is also the option to set the master threadto execute both communication and processing elements.

2There are code metrics to measure an implementation difficulty. Open Web Application SecurityProject(OWASP) [26] describes some of those Development Metrics, from the software security per-spective, some of them, like Lines of Code(LOC), Function Point,Cyclomatic complexity [25] can beuse to measure implementations complexity. The current work does not applies any of those metrics,but the pseudoscope provided buy the proposal papers of MultyQueue and k-LSM makes obvious thatMultyQueue implementation is easier.

21

1 /⇤ ⇤2 ⇤ P r i o r i t y Queue f u n c t i o n s a r e c o n s i s t e n t w i th t h e e q u i v a l e n t3 ⇤ f u n c t i o n s from C++ S t a n d a r d L i b r a r y Templa t e s f o r p r i o r i t y queue4 ⇤ /5

6 /⇤ ⇤7 ⇤ Adding i t em t o t h e p r i o r i t y queue8 ⇤ /9 vo id push ( Type d a t a ) { . . . }

10

11 /⇤ ⇤12 ⇤ Removes t h e h i g h e s t p r i o r i t y i t e m s of t h e p r i o r i t y queue , and13 ⇤ d e a l l o c a t e t h a t i t e m s memory14 ⇤ /15 vo id pop ( ) { . . . }16

17 /⇤ ⇤18 ⇤ R e t u r n s t h e h i g h e s t p r i o r i t y i t e m s of t h e p r i o r i t y queue19 ⇤ /20 Type t o p ( ) { . . . }21

22 /⇤ ⇤23 ⇤ Removes t h e h i g h e s t p r i o r i t y i t e m s from t h e p r i o r i t y queue and24 ⇤ r e t u r n i t t o t h e c a l l e r25 ⇤ /26 Type move ( ) { . . . }27

28 /⇤ ⇤29 ⇤ R e t u r n s t r u e i f t h e r e a r e no i t e m s a t t h e p r i o r i t y queue ,30 ⇤ o t h e r w i s e r e t u r n s f a l s e31 ⇤ /32 boo l empty ( ) { . . . }33

34 /⇤ ⇤35 ⇤ Re tu rn t h e amount o f i t e m s i n t h e p r i o r i t y queue36 ⇤ /37 i n t s i z e ( ) { . . . }

Figure 3.4: Priority Queue Interface

22

Algorithm 3 Framework main algorithm pseudocode code (Simplified version)1: if tid is not the main thread then2: while (no stop signal received) do3: proccesItem()4: end while5: else6: while no stop signal received do7: if master thread is processing then8: proccesItem()9: end if

10: communication with other process11: end while12: end if

3.1.2 Framework Interface

The framework was implemented as a C++ Template class. The framework class iscalled Astar_Framework. The template class is taking as generic arguments a classthat describes the problem, the problem class, a class that supports the problem class,the support class, a comparator of the problem class, the type of the priority queues andan optional class argument for sharing space between threads at the user scope.

Through the Astar_Framework constructor the user supplies the class with a pri-ority queue that contains the initial states, the MPI_Datatype that will be used forsend/receive problem states, the initial instance of the support class, the constant num-ber c required for the MultiQueue algorithm, the number of threads that it is intendedto be used beyond the master thread (consumer threads) and a boolean value that statesif the master thread will also consumes states. The framework starts running the A*algorithm when the user calls function: void AF_Run() and returns the best statefound using the Type AF_Solution() function.

Problem class The Problem class is a class that describes the problem states. Ithas to have only data members defined by the user to describe the different states ofthe problem, and following the rule of three [5] of C++. The user has to create anMPI_Datatype to describe the objects of that class because those objects will usedby the framework that send and receive operations.

Support class The support class has to provide to the framework the necessary func-tions to handle the Problem class objects. The description of those functions can befound in Figure 3.6.

23

Comparator class The comparator class is following the C++ concept for compara-tors. Is is used to compare the problem states.

1 /⇤ ⇤2 ⇤ Empty c o n s t r u c t o r o f t h e c l a s s wi th empty body3 ⇤ /4 S u p p o r t ( ) { }5

6 /⇤ ⇤7 ⇤8 ⇤ /9 S u p p o r t& o p e r a t o r = ( c o n s t S u p p o r t& o t h e r ) ;

10

11 /⇤ ⇤12 ⇤ C l a s s D e s t r u c t o r t o d e a l l o c a t e momory a l l o c a t e d from t h e13 ⇤ S u p p o r t c l a s s . The framework i m p l i c i t l y c r e a t e s new S u p p o r t14 ⇤ o b j e c t s t h a t w i l l be used by t h e d i f f e r e n t t h r e a d s ( one f o r15 ⇤ each t h r e a d ) .16 ⇤ /17 ~ S u p p o r t ( ) ;18

19 /⇤ ⇤20 ⇤ The framework i n f o rm each t h r e a d a b o u t i s t h r e a d i d .21 ⇤ /22 vo id s e t T i d ( c o n s t i n t i d ) ;23

24 /⇤ ⇤25 ⇤ I s c h e c k i n g i f t h e ’ s t a t e ’ i s a problem s o l u t i o n and r e t u r n s26 ⇤ t r u e i f i t i s , o t h e r w i s e i s r e t u r n i n g f a l s e . The s t o p v a r i a b l e27 ⇤ i s g i v i n g t h e f l e x i b i l i t y t o t h e u s e r t o c o n t i n u e s e a r c h i n g f o r28 ⇤ s o l u t i o n s a f t e r t h e f i r s t s o l u t i o n .29 ⇤ ( Th i s can be a p p l i e d f o r NP�Hard prob lems )30 ⇤ /31 boo l i s S o l u t i o n ( c o n s t S t a t e _ c l a s s ⇤ s t a t e , boo l &s t o p ) c o n s t ;32

33

34 /⇤ ⇤35 ⇤ I s g e n e r a t i n g t h e s u c c e s s o r s o f t h e s t a t e and r e t u r n i n g36 ⇤ them u s i n g t h e s u c c e s s o r s A r r a y . The l e n g t h o f t h e37 ⇤ a r r a y w i l l be r e t u r n u s i n g t h e v a r i a b l e .38 ⇤ I t i s e n t i r e l y up t o t h e u s e r how i t w i l l g e n e r a t e t h e39 ⇤ new s t a t e s .40 ⇤ /41 vo id g e n e r a t e S u c c e s s o r s ( c o n s t S t a t e _ c l a s s ⇤ s t a t e ,42 S t a t e _ c l a s s ⇤⇤ s u c c e s s o r s A r r a y ,43 i n t &s u c c e s s o r s ) c o n s t ;

Listing 3.1: The first listing

Figure 3.5: Support Class interface

24

1 /⇤ ⇤2 ⇤ I s i n f o r m i n g t h e framework a b o u t t h e e x p e c t e d max number3 ⇤ of s t a t e s t h a t can be g e n e r a t e from a s i n g l e s t a t e .4 ⇤ /5 i n t getMaxExpand ( ) ;6

7 /⇤ ⇤8 ⇤ I s r e t u r n i n g t h e h e u r i s t i c v a l u e o f t h e s t a t e .9 ⇤ /

10 i n t g e t H e u r i s t i c ( c o n s t S t a t e _ c l a s s ⇤ s t a t e ) c o n s t ;11

12 /⇤ ⇤13 ⇤ I s r e t u r n i n g t h e c o s t o f t h e s t a t e .14 ⇤ t h e h e u r i s t i c f u n c t i o n i s :15 ⇤ f ( n ) = g ( n ) + h ( n )16 ⇤ t h e c o s t i s g ( n )17 ⇤ /18 i n t g e t C o s t ( c o n s t S t a t e _ c l a s s ⇤ s t a t e ) c o n s t ;

Figure 3.6: Support Class interface cont.

3.1.3 Framework test cases

The first attempt to test the implemented framework was done using the AGP prob-lem. However, AGP problem was not a suitable problem for our implementation ofthe A* algorithm, because the A* algorithm can not give better results than a BreadthFirst Search. The second attempt, and the actual evaluation, was done using the TSPproblem. The work done related to the AGP problem is presented in Section A. Theproposed implementation related to the TSP problem is presented in Section 2.3.2.

TSP Test case

The TSP problem was described in Section 2.3.2. The current section describes ourimplementation approach related to the TSP problem.

Some important definitions:

g(n): The distance from the start city to the current city.

h(n): The estimated cost to complete the TSP walk. It is calculated as the distance toa city, which is not visited yet, plus the cost of a Minimum Spanning Tree (MST)starting from the that city which is not visited yet, and containing all the unvisitedcities plus the distance of the closest unvisited city to the start city.

f(n): g(n) + h(n)

All the parameters of the proposed classes are described in details below:

25

The Problem class contains one array of size equal to the problem cities, one field tostore the amount of visited cities, and an array to store the heuristic values. The firstn (where n are the visited cities) array indexes contain the visited cities, and indexesbeyond n store the unvisited cities. This is done in order to minimize the calculationsof identifying the unvisited cities needed for the MST.

The support class contains all the necessary functions to support the calculations neededto solve the TSP, like calculating MST, reading the initialization data, and functionsneeded by the framework, such as checking for solutions and finding the successors ofa state.

26

Chapter 4

Monitoring

It was decided to use a monitoring tool called As was mention nmon (Nigel’s Monitor) [3]monitoring performance tool was used to monitor system resources performance. nmonis a performance monitoring tool that can be used by system administrators. nmondeveloped by Nigel Griffiths while he was working for IBM.

nmon offers two modes for collecting data, it can capture the data and display themonline on screen, or it can store the data in a csv (Comma Separated Values) file forlater analysis. It can capture a variety of data, from the system, and for the peruse ofthis work, we are particularly interested on CPU utilization, memory usage, networktraffic.

It was chosen to use nmon because its cover our monitoring needs, as was mention thiswork focuses on CPU utilization, memory usage, network traffic and because nmon hasa specific version which supports ARM processors.

4.1 Methodology

For the current work we used nmon to monitor the cluster while it was running the TSPtest case presented at section 3.1.3. This chapter will present monitoring data collectedfrom the cluster, at first while the cluster is at idle mode and then while the TSP testcase is running under different configurations.

The framework presented at chapter 3 to archive the parallelization is splitting theworkload across different process and the workload of each process is shared across theprocess threads. The framework user can configure the number of threads in use, and theoperation mode of the master thread for each process in the search, either as a commu-nicator or as a communicator and worker. The MultiQueue approach to implementingparallel priority queues across individual processes threads offers relaxed operations,the different number of queues can affect the execution time and the resources used toreach a solution. The number of queues used MultiQueues is calculated as the product

27

of number of process elements accessing theMultiQueue ⇥ a constant number.The constant number can be set by the user. On the different monitoring sessions theapplication will also tested with different number of priority queues.

The Multiqueue approach to implementing parallel priority queues across individualprocesses threads offers relaxed operations, the different number of queues can affectthe execution time and the resources used to reach a solution

For each monitoring session the application runs for 30 times. After the first run eachof the next run starts 10 seconds after the previous run. nmon was set to collect dataeach second.

The presented data collected for an offline analysis. The cluster was monitor while itwas running the TSP solver, and the nmon was collecting data. To analyse the data theNMONVisualizer [4] software was used.

4.2 Monitoring sessions Analysis

4.2.1 Initial data from idle state

Figures 4.1 4.2 4.3 shows the Wee Archie’s Raspberries CPU usage, memory usageand network activity respectively. As it can be seen, when the Raspberries are at idlemode they are using less than 5 percent of CPU, they need around 50 MB for the runningprocess and at the network traffic the average observed traffic is 0.8 KB/s for readoperations, and 1.1 KB/s for write operations. The standard deviation of the averageobserved traffic is

Figure 4.1

28

Figure 4.2

Figure 4.3

4.2.2 Monitoring test case 1

The first test case of monitoring obtain using 2 threads to process TSP states and theother one to perform programs communications.

29

First monitoring session

For the first monitoring session two different scenarios where tested with the first onesetting the c of MultiQueues was set to 5 and the second one setting the c of Multi-Queues was set to 8. The average run time of the 30 runs(for each scenario) was 5.8927seconds and 6.3551 respectively. There is a difference of approximately 0.5 betweenthe two scenarios, but other than that the monitoring data where having many similarityso only the monitoring data from the first scenario are presented.

Because the graph for CPU usage, for the whole cluster is not reflecting on whats isreally happening on each individual Raspberry Pi only charts for the cores of "raspber-rypi01" will be provided, related to CPU usage.

From figures 4.4 4.5 4.6 4.7 it can be seen that each time the test case runs, the CPUusage of each core is increasing up to 100 percent overall. For the first run it can be seenthat the usage of cores 2 and 3 is running up to 100 percent and they are only runningusers jobs. It can be assume that cores 2 and 3 are running the consumers threads. Core1 usage is slightly over 65 percent running jobs from user and system. Core 4 usageis running up to 100 percent running jobs from user and system. It can be assume thatcores 1 and 4 are running the jobs that are related to the system operations and executingthe MPI system calls from the communication thread.

Figure 4.4: Raspberry Pi 1 CPU core 1

30



31


Figure 4.8 shows the memory usage from Raspberry Pi 1. It can be seen that each timethe test case runs the memory usage is increasing and when the application stops, thememory usage is decreasing. The memory usage between different runs is different,although is the same application running. This is can be explain of MultiQueue algo-rithm and MPI messages exchange. The MultiQueue algorithm has relaxed operationsand it can not guarantee that on a pop operation it will return the queue minimum item.Also it can not guarantee that if you twice a program that used the MultiQueue it willrepeat the pop and push operations from the same queues because the queue where theoperation will take place is selected randomly. The MPI messages can not guaranteedthat will be send and received with the same order twice.

32

Figure 4.8: Memory usage of Raspberry Pi 01

Figure 4.9 shows the network traffic of Wee Archie nodes. It can be seen that the readand write operations is not the same across the whole cluster. The presented data on Fig-ure 4.9 are containing outliers and because the cluster is almost inactive for 10 secondsbetween each run, due to that, the average and standard deviation are biased. Indica-tively the highest read standard deviation obtained from Raspberry Pi 10 at value of54.845, the lowest from Raspberry Pi 13 at value of 28.876. The highest write standarddeviation obtained from Raspberry Pi 04 at value of 23.326, the lowest from RaspberryPi 02 at value of 8.209. The standard deviation shows the divergence on network traf-fic for each Wee Archie node. The observed divergence of standard deviation happensbecause the collected contain a lot of outliers, created by the 10 second pause betweeneach run. The nodes with the highest network activity (read or write operations) theyaffected the most.

33

Figure 4.9: Wee Archie network traffic

4.2.3 Second monitoring session

For the second monitoring session 3 threads where used, and two of them where onlyconsuming states, and the third one was consuming states and executing the programsMPI calls. For the Three different configuration for MultiQueue algorithm constantnumber c where tested. The first one with c = 3, the second one with c = 4, and the thirdone with c = 5. The average execution time of 30 runs for each one was 5.1134, 5.1325,5.6532 respectively. Because the obtained data from monitoring are similar, only datafrom the first scenario are presented.

Figure 4.10 4.11 4.12 4.13 make more obvious than before that three out of fourprocessor cores are executing user jobs. One of the three cores that executing the userjobs is also executing the necessary system calls. From this figures is obvious thatthere is space to execute the application using four threads instead of three, but whenit was tested to run using 4 on thread (4 consumers and one of them was also handlethe communication), it run for 4 times, but after that MPI Process Manager was notresponding, probably because using 4 threads, was blocking MPI Process Manager fromtaking time at the processor, and its request where failing due to time out.

34



35



Figure 4.14 show the memory usage on each node cluster. It can be seen that there aresmall differences on the memory consumption on different cluster nodes. Figure 4.15shows the memory consumption of Raspberry Pi 1, and it follows the same pattern asthe previous monitoring sessions.

36

Figure 4.14: Wee Archie Memory usage

Figure 4.15: Memory usage of Raspberry Pi 1, at monitoring session 2

Figure 4.16 shows the network traffic of Wee Archie. It can be seen that the readoperations, are balance across the cluster but the write operations shows significant ref-erences between different nodes. The unbalance of the write operations is reasonablebecause after the framework executes the MPI_Iallreduce() the reduce function,indicates which node has to send states to other node, and as it can be seen at Figure4.16 some of the processes are sending more data than others, and the process which

37

sends the data and the amount of sending data is change unpredictably between differ-ent times. Table 4.1 shows the operations performed by each process rank for the first6 runs. It can be seen that the amount of send operations performed by each processshows big divergence between them. The amount of receive operations by each processis also different but the differences are not as big as the send differences. Those differ-ences(read/write traffic) can explain the different network traffic observed on differentnodes.

Indicatively the highest standard deviations observed are:

highest read standard deviation Raspberry Pi 05 144.800lowest read standard deviation Raspberry Pi 15 131.476highest write standard deviation Raspberry Pi 04 345.619lowest write standard deviation Raspberry Pi 15 99.722

Figure 4.16: Network usage of Raspberry Pi 1, at monitoring session 2

38

Process Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 std dev

Run | Operation

1. MPI_Isend 2545 244 8337 12280 7825 1903 6784 4292 339 2942 6692 3801 922 1306 1592 280 3544.051. MPI_Iallreduce 1431 1431 1431 1431 1431 1431 1432 1431 1431 1431 1432 1431 1431 1431 1431 1431 0.341. State Receive 3962 4119 3549 3311 3646 4018 3664 3836 4121 3983 3703 3894 4080 4061 4025 4113 240.532. MPI_Isend 865 138 329 5760 4480 34 4503 1375 425 1009 6937 1669 934 728 120 938 2213.992. MPI_Iallreduce 719 719 719 720 719 719 719 719 720 720 719 720 720 720 720 720 0.512. State Receive 1950 1993 1975 1618 1722 2028 1717 1935 1992 1931 1547 1908 1982 1985 2008 1956 150.653. MPI_Isend 2145 510 10746 10319 9337 278 8876 6021 1015 2579 11835 4548 3522 5416 215 476 4170.393. MPI_Iallreduce 1703 1703 1704 1703 1703 1703 1704 1703 1703 1703 1703 1703 1703 1703 1703 1703 0.343. State Receive 5040 5111 4411 4477 4565 5135 4600 4813 5167 5040 4430 4904 4974 4834 5175 5135 280.234. MPI_Isend 1932 5663 8248 17788 11737 156 5936 4624 1199 1503 8001 2635 2108 2236 603 640 4815.734. MPI_Iallreduce 1800 1800 1800 1800 1800 1800 1800 1800 1800 1800 1800 1800 1800 1800 1800 1800 0.004. State Receive 4845 4600 4445 3851 4305 5024 4646 4702 4922 4910 4423 4776 4828 4867 4963 4915 307.065. MPI_Isend 1506 663 8106 20026 3969 2126 2151 222 605 1344 4125 2375 925 610 184 403 4956.505. MPI_Iallreduce 1700 1700 1700 1700 1700 1700 1700 1700 1700 1700 1701 1701 1700 1700 1700 1701 0.405. State Receive 3157 3214 2748 1952 3098 3172 3181 3272 3239 3189 2992 3167 3250 3261 3262 3208 329.196. MPI_Isend 1537 552 14752 6380 10581 6615 10663 6074 1017 4603 15458 5801 2122 1959 651 591 4981.356. MPI_Iallreduce 1915 1915 1916 1916 1915 1915 1916 1916 1915 1915 1916 1915 1916 1915 1916 1916 0.516. State Receive 5848 5856 4891 5484 5277 5547 5280 5609 5927 5619 4934 5580 5854 5825 5909 5921 338.04

Table 4.1: This table presents the number of MPI_Isend, MPI_IAllreduce, and the number of received states for the first 6 runs out of 30. The last column showsthe standard deviation for each row

4.2.4 Monitoring Conclusions

As the test and the monitoring data showed the used test case is suitable for operationaltest on the machine, but not for long run test. It is necessary to monitor the clusterwith a program that uses MPI and can utilize all of the cluster resources but that couldrun for longer time, and produce less outliers than the current test case. The currenttest case produces a lot of outliers, because is not running continuously as mention atSection refsec:methodology. There was a trade-off, between monitoring the cluster forone run at the time or monitor the cluster for thirty runs, like performance test, and itwas choose to follow the second approach. The collected data shows that there is spaceto run for threads on each Raspberry Pi but when it was tested, it could not run for 30times like the other test. Clearly there is space to run a forth thread, but the forth threadshould be carefully selected so it leaves space to the Raspberry Pi operating system toperform its system calls.

There are divergence on the standard deviations between the two monitoring scenarios,and this is because when the first scenario run, the MPI_Iallreduce was happening toooften and the MPI was not Process Manager able to process the request, causing requesttime out. To run the test case with having the master only to consume states a timelimit set for the MPI_Iallreduce, that separates MPI_Iallreduce calls for at least 0.05seconds. As it was mentioned earlier the standard deviation is affected by the time slotof 10 second between consecutive runs, thus the increasing rate of messages sent by theapplication, causes standard deviation increase.

40

Chapter 5

Conclusions

5.1 Summary

During this work a framework was develop that provides users the flexibility to imple-ment solutions for their own problems and run them using distributed memory machinewith the minimal effort on palatalization. The framework was tested using the TSPproblem as a test case. The framework was evaluated on the Raspberry Pi cluster withsystem performance being monitored using the nmon tool and the collected data havebeen visualised, and analysed using the NMON visualiser provided by IBM.

5.2 Further Work

To at increase the functionality of the framework, a share space is that can provideaccess to the user to a shared space between the process threads can be added. The userwill be able to declare a struct with the necessary members to cover her needs and theframework will provide her access through the framework interface function arguments.This functionality had already tested but it is not included at the final version code wet.

During the development of the framework, it was tested on a share memory machine,of 64 cores, and implementation showed better scaling when it was running only usinga large number of threads. This is reasonable due to the fact that the synchronizationprocess of processes is adding an overhead to the whole program. Although, if theimplementation executed only using threads, the sharing state part of the algorithmwould be done by the MultiQueue algorithm.

As the test case test showed the is needed to monitor the cluster using an application thatcan run for longer time. Because the TSP test case was only running for 5 to 6 secondsand each run was starting 10 seconds after the previous run, the cluster network, wasnot used and the network traffic reduced significantly, the standard deviation of networktraffic was affected and shows divergence from the average value.

41

As was mention at Section 2.2.4 the advantage of k-LSM priority queues is the LSMdata structure which provides cache efficiency, it would be interesting to test the frame-work using a cache-oblivious priority queue and see its effect on performance and mem-ory usage.

As the TSP test case showed is needed to investigate approaches that can reduce thememory usage of the algorithm. Currently the memory growth of the test case problemdoes not allows as to solve bigger problems than 24 cities.

Another possibility of gaining performance is to try to move a part of the computationat Raspberry’s Pi GPU. For example, at test case TSP a MST is calculated before eachnew state generation. There are algorithms that calculating MST and can run on GPUprocessors [28]. It would be interesting to measure the performance of the test casein that case, because Raspberry’s Pi processor is SOC and memory transfer from CPUto GPU is cheaper than system who are using busses outside the CPU to communicatewith the GPU.

5.3 Conclusions

The project succeed to deliver an A* framework which can be used to solve a searchproblem, using multiple process an threads, that runs on Wee Archie. Because WeeArchie notes have only 1 GB of RAM, makes the platform memory sensitive, and ex-pose the need of develop a memory concern A* search if we want to utilize the process-ing capabilities of memory limited devices.

The work done on monitoring the Wee Archie operations shows that there is scope forfurther investigation of its capabilities. The program (TSP test case) used to utilizeWee Archie resources produced numerous outliers that affects the resulting analysis.To make a performance monitoring for Wee Archie is necessary to have an applicationthat is not highly depended on the available memory and 100 percent of at least 3 coresof each Raspberry Pi can be used.

42

Appendices

43

Appendix A

AGP Test case

This Section describes a solution for an important part of the AGP problem by usingour proposed implementation. The focused part id the 3-coloring problem of AGP.

To develop the necessary software it was needed to use a software packet that can dis-play the AGP instance and work on that. To do that the CGAL [2] (ComputationalGeometry Algorithms Libraries) were used combined with the Qt Framework. CGALwas used to make the the triangulation of the art gallery and the Qt Framework providesthe means to display the problem.

Figure A.1 shows the a map of a church (the data where obtain from [35]) (We will referto this map as initial polygon). Because the initial polygon is not a y or x monotone,it was needed to divide it at smaller monotone polygons. The division of the initialpolygon to monotone polygons is shown in Figure A.2. The polygons of Figure A.2 arey-monotone. It was needed to have monotone polygons because monotone polygonscan triangulate in linear time. The triangulation of Figure A.2 is shown at Figure A.3.Figure A.3 can also presented as a graph.

44

Figure A.1: The initial polygon

45

Figure A.2: The initial polygon splited to y-monotone polygons

46

Figure A.3: The triangulation of the initial polygon

A.0.1 Why AGP is not suitable as an A* test case

Assuming that an A* search is starting from node µ, as shown at Figure A.4. At triangleof node µ there are three vertexes that can be coloured and assuming that they havebeen coloured like Figure A.5. The A* expand node µ, produce the new note ⌫, adds⌫ at close list, and µ at open list. The algorithm takes ⌫ from open list, and coloursthe vertexes associate with ⌫ that are not coloured yet. The only uncoloured vertexassociate with ⌫ is v1, as Figure A.6 shows, and the only available colour for v1 is grey.This pattern can continued until the whole AGP is coloured.

If the above example had been done using the A* search the heuristic value would bethe order of visiting the nodes, which reduces the A* search to BFS search [31].

47

µ

Figure A.4: The A* search is ready to start from node µ

⌫

µ

Figure A.5: Colour set for at triangle vertexes where µ is in

⌫

v1

Figure A.6: A* search is ready to explore node ⌫ and colour vertex v1

48

Bibliography

[1] Binary search trees and skip lists.

[2] The computational geometry algorithms library. Accessed: 2016-08-24.

[3] nmon. http://nmon.sourceforge.net/pmwiki.php. Accessed:2016-08-24.

[4] Nmonvisualizer. http://nmonvisualizer.github.io/

nmonvisualizer/. Accessed: 2016-08-24.

[5] The rule of three c++. Accessed: 2016-08-24.

[6] In MPI: A Message-Passing Interface Standard Version 3.1 Message Passing In-terface Forum, 2015.

[7] Dan Alistarh, Justin Kopinsky, Jerry Li, and Nir Shavit. The spraylist: A scalablerelaxed priority queue. ACM SIGPLAN Notices, 50(8):11–20, 2015.

[8] David L Applegate, Robert E Bixby, Vasek Chvatal, and William J Cook. Thetraveling salesman problem: a computational study. Princeton university press,2011.

[9] Lars Arge, Michael A Bender, Erik D Demaine, Bryan Holland-Minkley, andJ Ian Munro. Cache-oblivious priority queue and graph algorithm applications.In Proceedings of the thiry-fourth annual ACM symposium on Theory of comput-ing, pages 268–276. ACM, 2002.

[10] Nick Brown and both EPCC. Alistair Grant. Introducing wee archie, November2015.

[11] Mark De Berg, Marc Van Kreveld, Mark Overmars, and Otfried CheongSchwarzkopf. Computational geometry. In Computational geometry, pages 1–17. Springer, 2000.

[12] Pedro Jussieu de Rezende, Cid C. de Souza, Stephan Friedrichs, Michael Hem-mer, Alexander Kröller, and Davi C. Tozoni. Engineering art galleries. CoRR,abs/1410.8720, 2014.

[13] Shantanu Dutt and Nihar R. Mahapatra. Parallel a* algorithms and their perfor-mance on hypercube multiprocessors, 1993.

49

[14] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guideto the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA,1979.

[15] John Grefenstette, Rajeev Gopal, Brian Rosmaita, and Dirk Van Gucht. Geneticalgorithms for the traveling salesman problem. In Proceedings of the first Interna-tional Conference on Genetic Algorithms and their Applications, pages 160–168.Lawrence Erlbaum, New Jersey (160-168), 1985.

[16] Georg Hager and Gerhard Wellein. Introduction to High Performance Computingfor Scientists and Engineers. CRC Press, Inc., Boca Raton, FL, USA, 1st edition,2010.

[17] Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. A formal basis for the heuris-tic determination of minimum cost paths. IEEE Transactions on Systems Scienceand Cybernetics, SSC-4(2):100–107, 1968.

[18] Akihiro Kishimoto, Alex Fukunaga, and Adi Botea. A.: Scalable, parallel best-first search for optimal sequential planning, 2009.

[19] Donald Knuth. The Art of Computer Programming / Vol. 3, Sorting and searching.Addison-Wesley, 1998.

[20] Richard E Korf. Depth-first iterative-deepening: An optimal admissible treesearch. Artificial intelligence, 27(1):97–109, 1985.

[21] D Lee and Arthurk Lin. Computational complexity of art gallery problems. IEEETransactions on Information Theory, 32(2):276–282, 1986.

[22] Patrick Lester. A* pathfinding for beginners. 2005.

[23] Jonatan Lindén and Bengt Jonsson. A skiplist-based concurrent priority queuewith minimal memory contention. In International Conference On Principles OfDistributed Systems, pages 206–220. Springer, 2013.

[24] John DC Little, Katta G Murty, Dura W Sweeney, and Caroline Karel. An algo-rithm for the traveling salesman problem. Operations research, 11(6):972–989,1963.

[25] Thomas J McCabe. A complexity measure. IEEE Transactions on software Engi-neering, (4):308–320, 1976.

[26] OWASP. Code review metrix, 1999.

[27] Patrick OâAZNeil, Edward Cheng, Dieter Gawlick, and Elizabeth OâAZNeil. Thelog-structured merge-tree (lsm-tree). Acta Informatica, 33(4):351–385, 1996.

[28] Gopal Pandurangan, Peter Robinson, and Michele Scquizzato. Fast distributed al-gorithms for connectivity and mst in large graphs. In Proceedings of the 28th ACMSymposium on Parallelism in Algorithms and Architectures, SPAA ’16, pages429–438, New York, NY, USA, 2016. ACM.

50

[29] Hamza Rihani, Peter Sanders, and Roman Dementiev. Brief announcement: Mul-tiqueues: Simple relaxed concurrent priority queues. In Proceedings of the 27thACM symposium on Parallelism in Algorithms and Architectures, pages 80–82.ACM, 2015.

[30] John W. Romein, Aske Plaat, Henri E. Bal, and Jonathan Schaeffer. Transpositiontable driven work scheduling in distributed search. In IN 16TH NATIONAL CON-FERENCE ON ARTIFICIAL INTELLIGENCE (AAAI’99, pages 725–731, 1999.

[31] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach.Prentice Hall Press, Upper Saddle River, NJ, USA, 3rd edition, 2009.

[32] John E. Savage and Markus G. Wloka. Parallel graph-embedding and the mobheuristic. Technical report, Providence, RI, USA, 1991.

[33] Nir Shavit and Itay Lotan. Skiplist-based concurrent priority queues. In Paralleland Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14thInternational, pages 263–268. IEEE, 2000.

[34] H. Sundell and P. Tsigas. Fast and lock-free concurrent priority queues for multi-thread systems. In Parallel and Distributed Processing Symposium, 2003. Pro-ceedings. International, pages 11 pp.–, April 2003.

[35] Davi C. Tozoni, Pedro J. de Rezende, and Cid C. de Souza. The ArtGallery Problem project, 2013. www.ic.unicamp.br/⇠cid/Problem-instances/Art-

Gallery/AGPPG.

[36] Peter van Emde Boas, Robert Kaas, and Erik Zijlstra. Design and implementation

of an efficient priority queue. Mathematical Systems Theory, 10(1):99–127, 1976.

[37] Martin Wimmer, Jakob Gruber, Jesper Larsson Träff, and Philippas Tsigas. The

lock-free k-lsm relaxed priority queue. In ACM SIGPLAN Notices, volume 50,

pages 277–278. ACM, 2015.

[38] Yichao Zhou and Jianyang Zeng. Massively parallel a* search on a gpu. In

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence,

AAAI’15, pages 1248–1254. AAAI Press, 2015.

51

Documents

Raspberry Pi Application & Monitoring