23
Machine Learning on Cell Processor Submitted By: Supervisor: Robin Srivastava Dr. Eric McCreath Uni ID: U4700252 Course: COMP8740

Paper on experimental setup for verifying - "Slow Learners are Fast"

Embed Size (px)

DESCRIPTION

This paper contain findings of implementing an online machine learning algorithm on Cell Broadband Engine v/s Intel Dual Core

Citation preview

Page 1: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

Machine Learning on Cell Processor

Submitted By: Supervisor:

Robin Srivastava Dr. Eric McCreath

Uni ID: U4700252

Course: COMP8740

Page 2: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

1

Abstract

The technique of delayed stochastic gradient given in the paper titled – “Slow Learners are Fast” theoretically shows how online learning process could be parallelized. However, with the real experimental setup, given in the paper, the parallelization does not improve the performance. In this project we implement and evaluate this algorithm on Cell and an Intel dual core processor with a target to obtain speedup with its outlined real experimental setup. We also discuss the limitations of Cell processor pertaining to this algorithm along with suggestion on CPU architectures for which it is better suited.

Page 3: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

2

1. INTRODUCTION 3

2. BACKGROUND 5

2.1 MACHINE LEARNING 5 2.2 ALGORITHM (REFERENCED FROM [LANGFORD, SAMOLA AND ZINKEVICH, 2009]) 6 2.3 POSSIBLE TEMPLATES FOR IMPLEMENTATION 6 A) ASYNCHRONOUS OPTIMIZATION 6 B) PIPELINED OPTIMIZATION 7 C) RANDOMIZATION 7 2.4 CELL PROCESSOR 7 2.5 EXPERIMENTAL SETUP 9

3. DESIGN AND IMPLEMENTATION 11

3.1 PRE-PROCESSING TREC DATASET 11 3.1.1 INTEL DUAL CORE 11 3.1.2 CELL PROCESSOR 11 3.1.3 REPRESENTATION OF EMAILS AND LABELS 12 3.2 IMPLEMENTATION OF LOGISTIC REGRESSION 12 3.3 IMPLEMENTATION OF LOGISTIC REGRESSION WITH DELAYED UPDATE 13 3.3.1 IMPLEMENTATION ON A DUAL CORE INTEL PENTIUM PROCESSOR 14 3.3.2 IMPLEMENTATION ON CELL BROADBAND ENGINE 15

4. RESULTS 17

5. CONCLUSION AND FUTURE WORK 19

APPENDIX I 20

BAG OF WORDS REPRESENTATION 20

APPENDIX II 21

HASHING 21

REFERENCES 22

Page 4: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

3

1. Introduction

The inherent properties exhibited by the online learning algorithm suggest that it is an excellent way of making the machines learn. This type of learning uses the observations either one at a time or in small batches and discard them before the next set of observations are considered. They are found to be a suitable candidate for real-time learning where data arrives in the form of stream and predictions are required to be made before the whole dataset has been seen. Online algorithms are also useful in the case of large dataset because they do not require the whole dataset to be loaded into the memory at once.

On the flip side this very suitable property of sequentiality turns out to be a curse for its performance. The algorithm in itself is a sequential one and with the advent of multi-core processors it leads to severe under-utilization of resources put forward by these high-end machines.

In Langford et. al. [1], the authors gave a parallel version of online learning algorithm along with its performance data when it was run on a machine with eight cores and 32 GB of memory. They did the implementation of the algorithm in Java. The simulation results were promising and they obtained speedup with the increase in number of threads as shown in (Figure 1). However, their efforts to parallelize the exact experiments resulted in a failure because of the high speed of serial implementation which was capable to handle over 150,000 examples/second. Based on the facts that the mathematical calculations involved in this algorithm can be accelerated by the use of SIMD operations and Java does not have any programming support for SIMD, we have implemented and evaluated this algorithm on Cell processor to exploit the SIMD capabilities of its specialized co-processors in the view to obtain the speedup for the real experimental setup. An implementation of this algorithm was also done for a machine having Intel dual core processor and 1.86 GB of RAM.

The Cell processor is the first implementation on Cell Broadband Engine Architecture (CBEA) having a primary processor of 64-bit IBM PowerPC architecture and eight specialized SIMD supported co-processors. The communication amongst these processors, their dedicated local store and main memory is done through a very high speed communication channel which has a capability to transfer at a theoretical peak rate of 96 B/cycle. The communication of data plays very crucial role for the implementation of this algorithm on Cell, the primary reason being the large gap between the amounts of data to be processed (approx. 76 MB) and memory available with the co-processors of Cell (256 KB). An efficient approach to bridge this gap is discussed in section of design and implementation. This section also gives the design of how the data was pre-processed

Figure Figure Figure Figure 1111 From From From From Langford et. al. [1]Langford et. al. [1]Langford et. al. [1]Langford et. al. [1]

Page 5: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

4

for implementation on Intel dual core and Cell processor. The section on background discusses about the gradient descent and delayed stochastic gradient descent algorithm, the possible templates for the latter’s implementation, an overview of Cell processor and the real experimental setup suggested by the designers of this algorithm. The result section shows comparative study of this algorithm on both the machines and we finally conclude in the last section of conclusion and future work. This section also provides a suggestion on the CPU architecture for which this algorithm would be better suited and we might expect a better performance in terms of speedup and reduced coding complexity.

Page 6: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

5

2. Background

2.1 Machine Learning

Machine learning is a technique by which a machine modifies its own behaviour on the basis of past experiences and performance. The collection of data of past experiences and performance is called training set. One of the methods to make a machine learn is to pass the entire training set in one go. This method is known as batch learning. The generic steps for batch learning are as follows:

A popularly known batch learning algorithm is gradient descent in which after every step the weight vector of the function moves in the direction of greatest decrease of the error function. Mathematically this is feasible due to the observation that if any real valued function )(xF is defined and differentiable in a neighbourhood of point a , then )(xF

decreases fastest in the direction of negative gradient of function )(xF at point )(aFa ∇− .

Therefore if )(aFab ∇−= η for 0>η being a small number then )()( bFaF ≥ . To perform the

actual steps, the algorithm goes as follows:

This algorithm, however, does not prove to be a very efficient one (discussed in Bishop and Nabney, 2008). Two major weaknesses of gradient descent are:

1. The algorithm can take many iterations to converge towards a local minimum, if the

curvature in different directions is very different.

2. Finding the optimal η per step can be time-consuming. Conversely, using a

fixed η can yield poor results.

Step 1: Initialize the weight vector 0w with sum arbitrary values

Step 2: Update the weight vector as follows

∇−=+ )()()1( τηττ

wEww

Where E∇ is the gradient of error function and η is the learning rate. Step 3: Follow step 2 for all the batches of data

Step 1: Initialize the weights.

Step 2: For each batch of training data

Step 2a: Process all the training data

Step 2b: Update the weight

Page 7: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

6

Some of the other more robust and faster batch learning algorithms are conjugate gradients and quasi-Newton methods. In gradient-based methods the algorithms are required to run multiple numbers of times to obtain an optimal solution. This proves to be computationally very costly for large datasets. There exists yet another method to make the machines learn. It involves passing records from training set one at a time (online learning). To overcome the aforementioned weakness in gradient-based methods there is an online gradient descent algorithm that has proved useful in practice for training neural networks on large data sets (Le Cun et al. 1989). It is also called sequential or stochastic gradient descent and it involves updating the weight vector of the function based on one record at a time. The update of weight vector is done for each record either in consecutive order or randomly. The algorithm steps of stochastic gradient descent are similar to the steps outlined above for batch gradient descent with a difference of considering one data point per iteration.

The algorithm given in (2.2) is a parallel version of stochastic gradient descent through the concept of delayed update.

2.2 Algorithm (Referenced from [Langford, Samola and Zinkevich, 2009])

The goal here is to find some parameter vector w such that the sum over functions if

takes the smallest possible value. In the algorithm if τ = 0, it becomes the standard

stochastic gradient descent algorithm. Here, instead of updating the parameter vector tw

by the current gradient tg , it is updated by a delayed gradient τ−tg .

2.3 Possible templates for implementation

There are three suggested implementation models for delayed stochastic gradient descent. Following any of these three model would lead to an effective implementation o the algorithm. Each model follow some assumptions based on the dataset being used. A model could be chosen on the basis of the constraints matching with the assumptions highlighted in a specific model.

a) Asynchronous Optimization

Assume a machine with n cores. We further assume that the time taken to compute

the gradient tf is at least n times higher than that to update the value of weight

vector. We run the stochastic gradient descent on all the n cores of the machine on

different instances of tf while sharing a common instance of weight vector. Each

InputInputInputInput: : : : Feasible space nRW ⊆ , annealing schedule tη and delay N∈τ

Initialization:Initialization:Initialization:Initialization: Set 0......1 =τww and compute corresponding )( tt wfg ∇=

For t = 1+τ to to to to τ+T do do do do Obtain tf and incur loss )( tt wf

Compute )( ttt wfg ∇=

Update )(minarg1 τη −∈+ −−= tttWwt gwww

End for Where Rf i →χ: is a convex function, χ is Banach space

Page 8: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

7

core is allowed to update the shared copy of weight vector in a round-robin fashion.

This would result in a delay of τ = n – 1 between when a core sees tf and when it

gets to update the shared copy of weight vector. This template is primarily suitable

when computation of tf takes a large time. This implementation requires explicit

synchronization for update of weight vector as it is an atomic operation. Based on the architecture of CPU significant amount of bandwidth could be exclusively used for the purpose of synchronization.

b) Pipelined Optimization

In this form of optimization we parallelize the computation of tf instead of running

the same instance on different cores. In this case the delay occurs in the second stage of processing of results. While the second stage is still busy processing the

result of the first, the latter has already moved on to the processing of 1+tf . Even in

this case the weight vector is computed with a delay of τ .

c) Randomization

This form of optimization is used when there is a high correlation between τ and tf

such that we cannot treat data as i.i.d. The observations are de-correlated by doing random permutations of the instances. The delay in this case occurs during the update of model parameters because range of de-correlation needs to exceed τ considerably.

2.4 Cell Processor

Cell processor is the first implementation of Cell Broadband Engine Architecture (CBEA) (Figure 2) which emerged from a joint venture of IBM, Sony and Toshiba. It’s a fully compatible extension of 64-bit PowerPC Architecture. The design of CBEA was based on the analysis of workloads in wide variety of areas such as cryptography, graphic transform and lighting, physics, fast-Fourier transforms (FFT), matrix operations, and scientific workloads.

The Cell processor is a multicore, heterogeneous chip carrying one 64-bit power processor element (PPE), eight specialized single-instruction multiple-data (SIMD) architecture co- processors called synergistic processing element (SPE) and a high-bandwidth bus interface (Element Interconnect Bus), all integrated on-chip.

The PPE consists of a power processing unit (PPU) connected to a 512 KB of L2 cache. It is the main processor of Cell and is responsible for running the OS as well as managing the workload amongst the SPE. The PPU is a dual-issue, in-order processor with dual-thread support. The PPU can fetch four instructions at a time and issue two. To better the performance of in-order issue, the PPE utilizes delayed-execution pipelines and allows limited out-of-order execution.

An SPE (Figure 4) consists of a synergistic processing unit (SPU) and a synergistic memory flow controller (SMF). It is used for data-intensive applications found readily in cryptography, media and high performance scientific applications. Each SPE runs an independent application thread. The SPE design is optimized for computation-intensive applications. It has SIMD support, as mentioned above, and 256 KB of its local store. The memory flow controller consists of a DMA controller along with a memory management

Page 9: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

8

unit (MMU) and atomic unit to facilitate synchronization issues with other SPEs and with the PPE. SPU is also a dual-issue, in-order processor like PPU.

SPU works on the data that exists in its dedicated local store which in turn depends on

channel interface for accessing main memory and local stores in other SPEs. The channel

interface runs independently of SPU and resides in MFC. In parallel an SPU can perform operations on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four single-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision.

The PPE and SPEs communicate through an internal high-speed element interconnect bus (EIB) [2] (Figure 3). Apart from these processors EIB also allows communication among off-chip memory and external IO.

The EIB is implemented as a circular ring consisting of four 16B-wide unidirectional channels. Two of them rotate clockwise and two anti-clockwise. These channels are capable of giving a performance of three concurrent transactions. The EIB runs at half the rate of system clock and thus have an effective channel rate of 16 bytes every two system clocks. At maximum concurrency, with three active transactions on each of the four rings, the peak instantaneous EIB bandwidth is 96B per clock (12 concurrent transactions * 16

Figure Figure Figure Figure 2222, Cell Broadband Engine Architecture, Cell Broadband Engine Architecture, Cell Broadband Engine Architecture, Cell Broadband Engine Architecture

Page 10: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

9

Figure Figure Figure Figure 3333 Element IntercoElement IntercoElement IntercoElement Interconnect Bus, from [3]nnect Bus, from [3]nnect Bus, from [3]nnect Bus, from [3]

bytes wide / 2 system clocks per transfer). The theoretical peak of EIB at 3.2 GHz is 204.8GB/s.

The memory interface controller (MIC) in the Cell BE chip is connected to the external RAMBUS XDR memory through two XIO channels operating at a maximum effective frequency of 3.2GHz. The MIC has separate read and write request queues for each XIO channel operating independently. For each channel, the MIC arbiter alternates the dispatch between read and write queues after a minimum of every eight dispatches from each queue or until the queue becomes empty, whichever is shorter. High-priority read requests are given priority over reads and writes. With both XIO channels operating at 3.2GHz, the peak raw memory bandwidth is 25.6GB/s. However, normal memory operations such as refresh, scrubbing, and so on, typically reduce the bandwidth by about 1GB/s.

2.5 Experimental Setup

The experiment is done using the asynchronous optimization (section 2.3). Figure 5 schematically describes the optimization. Each core computes its own error gradient and updates a shared copy of weight vector, shared amongst all the cores. This update is carried out in a round-robin fashion. The delay in computation and gradient and update of weight vector is of τ =n-1. Explicit synchronization is required for the atomic update of

Figure Figure Figure Figure 4444 SPE, from [4]SPE, from [4]SPE, from [4]SPE, from [4]

Page 11: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

10

weight vector. The experiment is run on the complete dataset involving all the available cores.

Figure Figure Figure Figure 5555 Asynchronous Asynchronous Asynchronous Asynchronous Optimization Optimization Optimization Optimization

Data

Data

Data

Data

Error Gradient

Error Gradient

Error Gradient

Error Gradient

Weight Vector

In Parall

Page 12: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

11

3. Design and Implementation

There were three stages in the implementation of the project

1. Pre-processing of TREC dataset

2. Implementation of logistic regression algorithm

3. Implementation of logistic regression in accordance to the methodologies

suggested in delayed stochastic gradient technique

3.1 Pre-processing TREC Dataset

3.1.1 Intel Dual Core

The dataset contains 75, 419 emails. These emails were tokenized by a list of symbols

(white spaces( ); comma(,); back slash(\); period(.); semi-colon(;); colon(:); single(‘) and

double inverted comma(“); open and close parenthesis(( )), brace({ }) and bracket([ ]);

greater(>) and less(<) than sign; hyphen(-); at the rate of symbol(@); equals(=); new

line(\n); carriage return(\r); and tab (\t)). Tokenization with the aforementioned symbol list

yielded 2,218,878 different tokens. A dictionary of tokens, containing the token name along

with a unique index for each token, was created and stored in a file (dictionary).

Figure Figure Figure Figure 6666 Pre Pre Pre Pre----processing TREC datasetprocessing TREC datasetprocessing TREC datasetprocessing TREC dataset

3.1.2 Cell Processor

Due to memory limitations on Cell processor a condensed form of dictionary was used. This condensed form contained first hundred features from the complete dictionary. On one hand the reduced size affected the performance of the algorithm in terms of accuracy and on the other it became more suitable for implementation on Cell. With the condensed form we transferred 32 mails vectors (discussion of vector form of mail representation

Raw Dataset

Complete

Dictionary

Condensed

Dictionary

Convert mails

to vectors

Create files for

each mail vector

Save to

Disk

Convert mails

to vectors

Create files for

each mail vector

Set F1

Set F2

Page 13: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

12

follows the current discussion) per MFC operation unlike the MFCs in the order of 10s for transfer of one mail vector if complete dictionary is used.

3.1.3 Representation of Emails and Labels

The emails were represented as linear vectors by using a simple bag of words

representation (Appendix I). The representation of emails was done in a struct data-type

having an unsigned int for the index value and a short for the weight of respective index.

Since the dimensionality of the complete dataset comes out to be very high therefore

hashing (Appendix II) was used with 218

bins. While constructing the dictionary it took

approximately 3 hours to process ~6000 emails. This estimate was drastically reduced by

the use of hashing and finally it took approximately half an hour to process all the emails in

the dataset. Once the dictionary was in place along with a working framework for hashing

a second pass on the entire dataset was carried out. In this pass each email was

converted to a bag of words representation and stored in separate file. The format of the

file was in the following pattern:

Figure Figure Figure Figure 7777 Email Files after pre Email Files after pre Email Files after pre Email Files after pre----processingprocessingprocessingprocessing

The labels were provided separately in an array of short time. A label ‘1’ signified that the

email is a ‘ham’ and label ‘-1’ signifies that it is a ‘spam’.

Since each mail was stored as a vector form in a file therefore on an average it took only

0.03 ms (on Intel dual core 2GHz) for parsing the emails and loading them into the

memory for logistic regression.

3.2 Implementation of logistic regression

For a two class problem (C1 and C2), the posterior probability of class C1 given the input data x and a set of fixed basis function )(xφφ = is define by the softmax transformation

)exp()exp(

)exp()()|(

21

11

aa

ayCp

+== φφ 3.1

Where the activations a1 is given as follows

φT

kk wa = 3.2

with )|(1)|( 12 φφ CpCp −= , w being the weight vector.

The likelihood function for input data x and target data T (coded in the 1-of-K coding

scheme) is then

( ) ∏∏==

==N

n

t

n

t

n

N

n

t

n

t

nnnnn yyCpCpwwTp

1

21

1

21212121 .)|(.)|(),|( φφ 3.3

where ))(( nknk xyy φ= , and T is the N x 2 matrix of target variables with elements tnk.

The error function is determined by taking the negative logarithm of the equation of

likelihood and its gradient could be written as

Page 14: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

13

∑=

−=∇N

n

nnjnjw tywwEj

1

21 )(),( φ 3.4

The weight vector wk for the given class Ck is updated as follows:

),( 21

1wwEww

kwkk ∇−=+ ηττ 3.5

where η is the learning rate.

In this project we have defined the first class as an email being a ‘Ham’ and second class

is for it being a ‘Spam’. The feature man φ is the identity function, xx =)(φ . The weight

vectors are initialized as zero.

For the purpose of comparison two version of implementation of logistic regression was

provided. The first version was sequential and the second version of implementation was

parallel. As per the claim by the authors of the delayed stochastic gradient technique, the

parallel version gave a better performance compared to the sequential version without

affecting the correctness of the result. The comparison of performance is given in Section

4.

3.3 Implementation of Logistic Regression with delayed update

To incorporate the concept of delayed update the equation (3.5) mentioned above was changed according to the algorithm described in Section 2.2. This required processing of computing the error gradient on divided set of input separately. The division of input was carried out differently for Intel Dual core and Cell processor. For the former case this division was more direct with less programming complexity, however, for the latter the division had to be carried out explicitly and it involved a significant complexity in terms of programming. The division of data is explained in detail in the following discussion.

The representation chosen for the mail helps in improving the time performance of the algorithm. Since we are storing the indices of the vectors therefore while updating a weight vector in accordance to the contributions made by a specific mail vector we do not need to iterate through the complete dimension of the weight vector and error gradient. This is because the contributions by a particular mail vector would affect indices which are present in it. This Figure 8, below shows this concept pictorially.

Page 15: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

14

Figure Figure Figure Figure 8888

3.3.1 Implementation on a Dual Core Intel Pentium processor

For implementation on the Intel dual core machine (2 GHz with 1.86 GB of main memory)

the processed email from the complete dictionary was used. The mail-vectors were

created as and when they were required. The first core processed all the odd emails and

second one all the even ones. Each core computed the error gradient separately along

with updating a private copy of weight vectors for each core. The shared copy of weight

vectors were updated atomically by both the cores.

This implementation used OpenMP constructs for parallelization of the algorithm. Using

OpenMP helped in the division of email. The thread number was augmented with a

counter to determine the mail number. This ensured that no two thread would access the

same data.

Figure Figure Figure Figure 9999 Implementation on Intel Dual Core Implementation on Intel Dual Core Implementation on Intel Dual Core Implementation on Intel Dual Core

Mail 1

Mail 2

Mail 3

Mail 4

Mail N

Core 1

Core 2

Note: N is odd

Computes

error gradient

Computes

error gradient

Update

weight vector

(atomically)

Set

F1

6 1

13 3

73 2

88 5

1

6

13

73

88

D

1

6

13

73

88

D

index count

Error

Gradient

Weight

Vector

Mail Vector

D: Dimension of

weight vector

and error

gradient

Page 16: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

15

3.3.2 Implementation on Cell Broadband Engine

Implementation of the algorithm on Cell processor used the processed mails generated

from the condensed dictionary. The data was divided sequentially into chunks for each

SPE. The PPE was responsible for constructing the labels and the array of mail vectors.

Using the MFC operation the data was made available to SPE. Each MFC operation

transferred data for 32 mails. This value was chosen because of the limited capacity (256

KB) of local store of SPEs.

The SIMD implementation on Cell could not benefit from the implementation model shown

in Figure 10. This is because for a full scale SIMD implementation there are operations

involved for converting the data in the __vector form specialized for SPEs. Since we are

storing the indices separately therefore converting the data to appropriate __vector would

require rearranging them according to the indices. This rearrangement would require large

number of load operations and would affect the overall benefits from SIMD operation. The

complexity for time converting the data to __vector would be O(N2) where N is the

dimension of mail-vector.

Figure Figure Figure Figure 10101010 Implementation on Cell Implementation on Cell Implementation on Cell Implementation on Cell

Mail 1

Mail 2

Mail 3

Mail 4

Mail N

PPE

Note: N is odd

Computes error

gradient

SPE-1 SPE-2 SPE-6

Computes error

gradient

Update weight

vector

Computes error

gradient

Update weight

vector

Update weight

vector

Main

memory

Set

F2

Page 17: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

16

For the parallel version of the algorithm each SPE required a maximum of four weight

vectors to be stored in the local store. Two among them were supposed to be owned

privately by the SPE and the remaining two was shared among all the SPEs. Along with

the weight vector each SPE would also be required to store two error gradients. The data-

type for each of these quantities is float. Considering the dictionary containing 2,218,878

features the requirement of memory tend to be the order of MBs. Following two data

structures were considered for storing these quantities:

a) Storing the complete data as an array of required dimension. This data structure is

straight forward and easy to implement but there is possibility of potential wastage

of memory. For the original dimension of 2,218,878 this data structure would require

approx. 50 MB of memory for each instance of SPE. This is obviously not feasible

as the local store on SPEs are only of 256 KB.

b) The second data structure is to use struct having an index and a count value for

each entry. Since most of the values in weight vector and error gradient are not

required (refer to discussion pertaining to Figure 8), therefore by using this data

structure the required memory was significantly reduced and theoretically came in

the order of few MBs(approx. 3). This is also not feasible because of the limited size

of local store in SPE.

With the use of data generated by condensed dictionary and the latter data structure

proposed above, the requirement got reduced to 2400 bytes. Rest of the memory available

with the local store was used for storing the mail-vectors and the target labels.

To hide the latency of transfer of data from the main memory to the local store of the SPE, the technique of double-buffering could be used. While the SPU is performing the computation on the data, the MFC could be used to bring more data from main memory of the system to the local store of the Cell. Therefore the wait for transfer of data is reduced and latency of transfer is hidden (either partly or completely). The algorithm for processing while doing double buffering is as follows:

1. The SPU queues a DMA GET to pull a portion of the problem data set from main memory into buffer #1.

2. The SPU queues a DMA GET to pull a portion of the problem data set from main memory into buffer #2.

3. The SPU waits for buffer #1 to finish filling. 4. The SPU processes buffer #1. 5. The SPU (a) queues a DMA PUT to transmit the contents of buffer #1 and then (b)

queues a DMA GETB to execute after thePUT to refill the buffer with the next portion of data from main memory.

6. The SPU waits for buffer #2 to finish filling. 7. The SPU processes buffer #2. 8. The SPU (a) queues a DMA PUT to transmit the contents of buffer #2 and then (b)

queues a DMA GETB to execute after thePUT to refill the buffer with the next portion of data from main memory.

9. Repeat starting at step 3 until all data has been processed. 10. Wait for all buffers to finish.

Page 18: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

17

Performance

41000

42000

43000

44000

45000

46000

47000

48000

0 1 2 3 4 5 6 7

N umbe r of S P E

4. Results

The experiments on Intel dual core machine was run by using the mails processed with the complete dictionary. The time taken on this machine is significantly higher than that on Cell. Serial implementation of logistic regression on Intel dual core for two simultaneous run takes 36.93 sec and 36.45 sec.

The taken by the parallel implementation using delayed stochastic gradient method is given in Table 1.

Number of Threads

Time in seconds (run 1)

Time in seconds (run 2)

1. 113.09 47.09 2. 20.85 20.92

Table 1

For a single thread in the first run the time taken is very large compared to any other time. This is because most of the memory load operation would be resulting in a cache miss. Since all these runs were performed consecutively therefore the time is drastically reduced because of reduction in the cache miss rate. It is also observed that this algorithm renders a poorer performance when run with single thread as compared to that of serial implementation. This time should be theoretically same; however, in the case of delayed stochastic process extra time is spent in division of data which end up not being used anywhere.

The Table 2 below shows the performance of the algorithm on multiple SPEs. The performance with respect to time gets better with the increase in the number of SPEs. These values are plotted in the graph given below. The performance on SPE although shows better results, however, the accuracy is suffered to a great extent (results not shown here).

Table Table Table Table 2222

The use of condensed dictionary comes with a severe penalty of

Number of SPE

Time in micro seconds

1 47398

2 44419 3 42407 4 42384 5 42144 6 41966

Page 19: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

18

accuracy.

The issue of accuracy could be solved by the use of complete dictionary. However, the memory limitation on Cell processor constrained the use of complete dictionary.

Page 20: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

19

5. Conclusion and Future Work

The approach of delayed update shows better time performance with the increase in parallelization. The improvement was shown for Intel dual core as well as for Cell processor. The former machine being SMP capable had less overhead of data division as compared to that on latter. The use of Cell processor posed several limitations on the implementation of this algorithm, the primary one being the memory limitation. The limitation of memory caused extra overhead due to communication. A dataset having less feature vectors might be expected to perform with a better speedup on this machine. For a data set with large feature vectors, this algorithm might perform better on a symmetric multiprocessing machine (SMP). A study of this algorithm could be done on more powerful SMP capable machine with large amount of main memory as the amount of memory required to store the data doubles with a unit increase in level of parallelization.

Page 21: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

20

Appendix I

Bag of Words Representation

A bag-of-words representation is a model to represent a sentence in the form of vector. It is frequently used in natural language processing and information retrieval. This model represents a sentence as an unordered collection without any regard for grammar.

To form a vector for a sentence, firstly, all the distinct words are identified in it. Each distinct word is given a unique identifier called index. Each index serves as a dimension in a D-dimensional vector space, where D is total number of unique words. The magnitude of the vector in a particular dimension is determined by the count of words having that index. This process requires two passes through the entire dataset. In the first pass a dictionary containing the unique words along with their unique indices is created. In the second pass the vectors are formed by referencing to the dictionary.

For example:

Consider the following sentence

What do you think you are doing?

Word Index what 0 do 1 you 2 think 3

are 4 doing 5

The resulting vector for the above sentence would be as follows:

1(0) + 1(1) + 2(2) + 1(3) + 1(4) + 1(5)

The vector dimension is given in parenthesis and the respective magnitudes are given along side. The magnitude of dimension 2 is 2 because the word you is repeated twice in the sentence. Others are one for the similar reason.

Page 22: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

21

Appendix II

Hashing

Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string. Hashing is used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value. It is also used in many encryption algorithms.

The hashing function used in this project is same as the one used by Oracle’s JVM. The code snippet performing hashing is pasted below

unsigned int hashCode(char *word, int n) { unsigned int h = 0; int i; for(i=0;i<n;i++) h += word[i] * pow(31, n-i-1); return h%SIZE; }

Page 23: Paper  on experimental setup for verifying  - "Slow Learners are Fast"

22

References

[1] John Langford, Alexander J. Samola and Martin Zinkevich. Slow learners are fast published in Journal of Machine Learning Research 1(2009)

[2] Michael Kistler, Michael Perrone, Fabrizio Petrini. Cell Multiprocessor Communication Network: Built for Speed.

[3] Thomas Chen , Ram Raghavan , Jason Dale and Eiji Iwata. Cell Broadband Engine Architecture and its first implementation

[4] Jonathan Bartlett. Programming high-performance applications on the Cell/B.E. processor, Part 6: Smart buffer management with DMA transfers

[5] Introduction to Statistical Machine Learning, 2010 course assignment 1 [6] Christopher Bishop, Pattern Recognition and Machine Learning.