A Concurrent and Energy Efficient Approach of Data Coding on Multicore Computing Systems · 2016-04-06 · A Concurrent and Energy Efficient Approach of Data Coding on Multicore Computing

A Concurrent and Energy Efficient Approach of Data Coding on

Multicore Computing Systems Team: 51

Steven Chen, Los Alamos High School

Final Project Report, April 5th, 2016

Executive Summary In this project, I studied parallel programming on multicore computing systems and

investigated potential parallel processing capabilities using data encoding procedures. Data coding

techniques have been used in wired and wireless data transmission areas for many years. These

techniques involve heavy mathematic computing operations. Usually byte streams are used in

traditional data transmission and normally embedded processors are used to handle efficient byte

stream data coding processes. The “software data coding” method was not used in very large scale

storage systems until ten or twenty years ago. This is due to the fact that, back then the CPU

computing power could not efficiently handle software coding processes. Today we see that

advanced multicore computing systems are commonly used in many commercial and scientific

applications. I studied the data coding problem on storage systems because I attempted to explore

the task parallelism feature from a multicore computing system and efficiently apply parallel

processing capabilities with respect to the data coding problem.

I implemented a parallel erasure coding software called parZfec. I then conducted various

workload tests on a single compute node and multi-node cluster. I demonstrated the advantage of

using parallel processing on multicore computing systems. The parZfec software showed

significant reductions in total encoding processs time. Futhermore my results showed that we could

obtaing almost linear-scaling bandwidwith on single multicore machines and multiple multicore

cluster based machines in terms of processing time reduction and bandwidth improvemnet. I also

demonststrated that the data coding problem on storage systems could effectively and efficiently

be solved by task parallelism on multicore computing systems.

A Concurrent and Energy Efficient Approach of Data Coding on

Multicore Computing Systems Team: 51

Steven Chen, Los Alamos High School Final Project Report, April 5th, 2016

1. Introduction – Investigating the Problem and Motivation 1.1 Investigating the Problem: Multicore programming and data encoding challenges in big data computing environments

Today, multicore processors (Intel or ARM based) are used in numerous commercial products such as: smartphones, tablets, laptops, desktops, workstations, and servers. Almost all computers now are parallel computers. It is often heard that people ask “Why can’t my applications run on multiple CPU cores?” In order to take advantage of multicore chips, applications and systems should be re-designed to fully utilize potential parallelism in applications and parallel computing capabilities [1]. Furthermore to capitalize fully on the power of multicore systems, we need to adopt parallel programming models, parallel algorithms, and parallel programming libraries that are appropriate for these multicore computing systems.

Data coding techniques have been used in wired and wireless data transmission areas for many years. These techniques involve heavy mathematic computing operations. Usually byte streams are used in traditional data transmission and normally embedded processors are used to handle efficient byte stream data coding processes. The “software data coding” method was not used in very large scale storage systems until ten or twenty years ago. This is due to the fact that, back then the CPU computing power could not efficiently handle software coding processes. Generally, the hardware based technologies are used to support coding solutions, such as RAID (originally redundant array of inexpensive disks, now commonly redundant array of independent disks). RAID technologies were proposed, designed and have been used in storage systems since 1990 [2]. As hard disk capacity gets increasingly larger, the data rebuilding time get dangerously long when a disk fails. It also increases the chance that another disk will fail before the rebuilding process is completed. Hardware based RAID technologies are running into limitation and cannot handle very large storage systems [2].

Software coding processes have become increasingly popular in recent years. Today with advanced hardware (HW) design technologies, such as multicore CPU and advanced vectoring processing techniques (SIMD: Single Instruction and Multiple Data), individuals and corporations can effectively apply software coding techniques on storage systems. Data storage systems can greatly benefit from software coding techniques due to improved reliability of the storage media [3][4][10][11][12].

1.2 Motivation

My main motivation behind this project was to answer the question of: “How to efficiently utilize multicore computing systems in regard to the data coding problem?” By definition a multi-

core processor is a single computing component with two or more independent actual processing units (called "cores"). Cores are the units that read and execute program instructions. The improvement in performance gained by the use of a multi-core processor depends on the software algorithms used as well as their implementation. In particular, possible gains are limited by the portion of the software that can be run in parallel simultaneously on multiple cores. The challenge of programming multi-core processors is legitimate with tangible impacts. In this case there is little or no hidden parallelism to be found. Parallelism must be exposed to and managed by software. Concurrent computing and parallel computing are the two most common approaches to utilize multicore computing systems.

The current data coding software approach is done with a single process approach [3][10][11]. This approach cannot fully utilize a multicore computing system. Most applications are still using single CPU computing models. To improve this situation and to capitalize fully on the power of multicore systems, we need to adopt programming models, parallel algorithms, and programming languages that are appropriate for the multicore world. Furthermore we need to integrate these ideas and tools into the courses that educate the next generation of computer scientists.

In this project, I investigated the potential parallelism features of the data encoding problem on shared and non-shared data objects. I leveraged the open source erasure code software library, applied the MPI parallel programming library [6][7], and implemented a parallel erasure coding software (parZfec) on multicore computing systems. For this project I chose the concurrent computing approach with the implementation of parallel coding software. The focus of this project was to study the impact of concurrent coding processes on multicore computing systems. This was modeled using concurrent data encoding software using MPI on multicore computing systems [8][9]. Finally, the results were evaluated by examining the parallel coding software in terms of coding bandwidth improvement and energy cost savings. As mentioned previously this project was implemented through a MPI program that models the current software coding process. The software was developed on a multicore x86 based workstation machine.

2. Introduction of Data Encoding Problem A typical data encoding and decoding process is shown in Figure 1.

Source data data sink

Figure 1: Data coding process – encoding and decoding

The Reed-Solomon coding [4] illustrates the encoding process. Reed-Solomon codes are block-based error correcting codes with numerous applications in digital communication and storage. In Reed and Solomon code we normally use a pair (K, M) to illustrate the coding scheme. It can take a message, break it into K pieces, add M “parity” pieces, and then reconstruct the original from K of the (K+M) pieces.

The Reed-Solomon algorithm creates a coding matrix in which you multiply your data matrix to create the coded data. The examples below use a “4+2” coding system, where the original file is broken into 4 pieces, and then 2 parity pieces are added (Figure 2A) . The matrix is set up so that the first four rows of the result are the same as the first four rows of the input. This means that the data is left intact, and all it is really doing is computing the parity [3].

Coding Matrix Original Data Encoding data:

4 Data chunks +

2 Code chunks

Figure 2A: Reed-Solomon encoding example

Both Reed-Solomon encoding and decoding processes involve heavy array, matrix, and table look-up procedures [4]. They require powerful computing systems to handle these computing procedures.

Encoder Decoder Communiction

channel or storage devices

Here I provide a simplified description of the coding arithematic operations. Matrix A, matrix E, data matrix D, matrix F, and identity matrix are defined here as:

A = I E = Data D = the source data

F Code

I =

F is used as the checksum calculation matrix. F is defined as

F = or

If fi,j = ji-1 A simplified encoding process can be viewed as A * D = E

I applied the encoding process on the source data D, generated encoded data chunks and code chunks, and stored the generated chunks on storage devices.

{Data chunks} = Identity Matrix * D

{Code chunks} = F * D

Figure 2B: A brief matrix calculation of the Reed-Solomon encoding

Identity Matrix: I

Checksum coding

Matrix: F

N data chunks

M data checksum

code chunks

3. The Proposed parZfec Software Approach and Parallel Programming Model

The parZfec can handle two data encoding process types: multiple no-shared object

encoding (N-to-M parallel I/O model) and single shared-object encoding process for very large files (1-to-N parallel I/O model). I have designed four different MPI process types in parEC: Main Process (MPI process Rank 0), SourceDataExplorer Process (MPI process Rank 1), FinalReport Process (MPI process Rank 2), and parZfec Process (MPI processes Rank 3 to N-1).

I present the following parZfec software system in the Figure 3.

Figure 3: parZfec software system diagram

parZfec

Source

Data

Explorer

Data chunks

Source Object Data

Sets

parZfec

parZfec

parZfec

Code chunks

Object Data List

……………..

Data chunks

Code chunks

Data chunks

Code chunks

Destination

……

……

…..

N MPI processes are launched in the parZfec software. At least four MPI processes are required to launch this parZfec program. The function of each type of MPI process is described as follows:

• mainManager (MPI process Rank 0) : This function of this process is to coordinate the other three MPI process types and handle the overall administration jobs. First it sends the “INIT” message to all other MPI processes and starts the initialization process. It asks the SourceDataExplorer process to discover how many source objects as passed to the parZfec program and waiting to be processed. Then it checks the incoming message queue and determines which parZfec process is asking for a source object. Furthermore it also inspects the receiving message from a parZfec process and updates the encoding result for each processed object data. It then picks an unprocessed source object data from the source data queue, packs the source data object’s information into a message, and sends it back to the parZfec process. A demand based workload assignment protocol is maintained between the Manager Process and parEC processes. When all source data objects are processed by parZfec processes, the main manager process will issue an “ENDTAG” message to all parzfec processes and finialize the encoding process.

• SourceDataExplorer (MPI process Rank 1): The SourcedataExplorer process is the

READDIR/fiel tree walk function on “source object data” location. It explores each individual source object data. It records each object information. This includes: filename/objectname, object ID, size, encoding scheme ratio (K, M), and destination location. All source data objects will be put into a dataQueue for further workload assignment. The SourceDataExplorer process also handles multi- part-object data partitioning. If a file size is too big, the DataExplorer process invokes a partitioning process (called single shared object 1-to-N parallel I/O chunking process), dividing this file into N sub-chunks, and inserting information of each sub-chunk into the data queue. The single big shred object 1-to-N parallel I/O chunking process is still under implementation and its result is not included in this report. By default, only one SourceDataExplorer process is launched. More SourceDataExplorers will be used if a very large data set is processed.

• Finalreport (MPI process Rank 2): After all the source object data was encoded, the mainManager process will ask the FinalReport to prepare a final report. The final report incudes:

A. Each parZfec process: number data assigned, each assigned data’s starting encoding time and finished encoding time, the first assigned data’s starting processing time, the last assigned data finished time, total encoding time, total enoding data size, average assigned data size, the minimal data size of assigned data objects, the maximal data size of assigned data, and encoding bandwidth

B. Accumulated encoding bandwidth from parZfec processes C. Parallel encoding time calaulation from parZfec processes

• parZfec (MPI process Rank 3 to N-1, at least one parZfec process launched): After

receiving the “INIT” message from the main manager process, each parZfec process sends a “WORKLOAD” tag message and asks for the next object data to be encoded. If a parZfec process has finished the encoding process on an assigned data object, the parZfec process

stores the encoding results, such as starting time, finished time, and parZfec process Rank information in a message. It then sends this message with a ‘WORKLOADDONE” tag to the mainManager. After the mainManger receives the “WORKLOADDONE” tag message from a parZfec process, the mainManager will check and decide next object data assignment. The parZfec process is the core computing procedure to realize the coding task. In this project, I intend to take the task parallelism feature of a multicore computing system and study the benefit of applying parallel programming on multicore computing systems.

4 Software and Hardware used in Code Development and Testing 4.1 Software

• C programming language: gcc C compiler • Python application software: Python 2.7 release • Centos 7.2 Linux Operating System • MPI programming library: OPENMPI 1.10.1 release • Zfec: This “zfec” package implements an "erasure code", or "forward error correction

code". This zfec package is largely based on the old "fec" library by Luigi Rizzo et al., which is a mature and optimized implementation of erasure coding. The zfec package makes several changes from the original "fec" package, including addition of the Python API, refactoring of the C API to support zero-copy operation, a few clean-ups and optimizations of the core code itself, and the addition of a command-line tool named "zfec" [9]. This zfec is running as a single process program. I reused zfec as a library function called in the poarZfec software. I did not intend to implement a new erasure coding software because existing open-source single process software solutions have proved to be accurate and already widely deployed on commercial application software. My intention in this project was to attempt to turn a single process based erasure coding procedure into a parallel erasure coding procedure on a multicore computing system.

4.2 Hardware:

I setup a three-node testing multicore cluster (Figure 4). o Computing node: Two Supermicro 1U servers: Dual Quad-Core CPU/16 threads,

64 GB DDR3 memory, four HGST 4TB Sata Disks, one 10gigabit Link Configure a BTRFS RAID0 on local four SATA disks I launched MPI processes on one or two compute nodes.

o Storage Node: One Supermico 1U server: Dual Quad-Core CPU/16 threads, 16 GB DDR3 memory, four Samsung 1TB SSD devices, dual 10-Gigabit links Configure a BTRFS RAID0 on local SSD devices Configure dual-10-Gigabit link bonding I used dual-10Gigabit link bonding on this storage node so it could

provide enough data communication bandwidth to support parallel erasure coding from two computing nodes.

o One 24-port Arista 10-Gigabit switch was used to provide connectivity.

Figure 4: test bed setup 5 Testing Data Generation and Performance Testing

After the software had been developed, I conducted numerous tests on a single multicore

machine and a three-node multicore cluster. 5.1 Data Set Generation I used a PERL script to generate testing data sets. The PERL data generation script is shown in the Figure 5.

In this PERL script, I used a random number generator and the “dd” function to generate data sets. Here is a description of the PERl DATA GENERATION script.

• scopeList is used to control the data size distribution – I used both data sets in our performance testing. I designed the following data size distribution

<= 50MB : 58.33% <= 100MB : 66.67% <= 400MB : 83.33% <= 1000MB : 91.67% <= 4000MB : 100.00%

• The first random number is used to select data size distriobution range from the scopelist

• I then applied the randomly selected data size range to generate a 2nd random number as the actual data size. Finally, I issued a “dd” command to generate a real data object.

• The number of data objects is passed to the PERl script from the CLI interface. The intention of the workload design was to simulate a real world data size distribution [5].

I generated two data sets and conducted testing on a single node multicore machine and a dual-node multicore cluster. The 500-date set has five hundred data objects and is applied to the single node multicore machine. The 4000-data-set has four thousand data object is applied to the dual-node multicore cluster.

Data size distribution is shown in Figure 6A and Figure 6B. The 500-data-set represents a small data-set and the 4000-data–set represents a larger data-set. Due to the hardware limitation

Coompute node-1

Coompute node-2

Storage node2Confi

10 GiGE

switch

of the test-bed setup, the total testing time and total data size in the testing workload design is taken into consideration.

Figure 3: Data set generationg PERL script

Figure 5: Perl script used to generate data set for testing

#!/usr/bin/perl $name="data"; printf "argument 1 $ARGV[0]\n"; $numData = $ARGV[0]; printf "argument 2 $ARGV[1]\n"; $prefix=$ARGV[1]; printf "argument 3 $ARGV[2]\n"; $suffix=$ARGV[2]; @scopeList = (10,10,20,20,40,40, 50, 100, 400,400,1000, 4000); for( $i=1; $i <= $numData; $i++) { $scope = int(rand(12)); printf "scope = $scope\n"; printf "scopeList $scopeList[$scope]\n"; $fileSize = int(rand($scopeList[$scope])); printf "fileSize = $fileSize\n"; if($fileSize == 0) { $fileSize = 1; }; $fileName=$prefix."-".$name."-".$i.".".$suffix; printf "filename=$fileName\n"; $cmd = "time dd if=/dev/zero of=".$fileName." bs=1M count=".$fileSize; printf "cmd=$cmd\n"; system($cmd); }

Figure 6A: 500 data set data size distribution

Figure 6B: 4000 data set data size distribution

0

500

1000

1500

2000

2500

3000

3500

4000

500 Data Set Data Size Distribution: Unit MB

Data Size Distribution

0

500

1000

1500

2000

2500

3000

3500

4000

4500

4000 Data Set Data Size Distribution: Unit MB

Data Size Distribution

5.2 500-data-set Testing on a Single Node Multicore Machine In this test, I applied the 500-data-set on a single multicore machine. I conducted five

different testing cases. Table -1 shows the MPI process launched on a single multicore computing system. The total data size from the 500-data-set is 104.06GB. Average data size is around 208.12MB. The smallest data size was 1MB and the largest data size was 3763MB.

# MPI Porcess launched # of parZfec process 4 1 Refernce to the attached

source code for each RANK process.

5 2 7 4 11 8 19 16

Table-1: 500-date-set MPI processes information Figure 7A shows the processing time comparison. I used the processing time of “using 16

parZfec” as the base case and normalize the processing time. Furthermore it is shows the normalized processing time comparison in Figure 7B. In conclsion we can reduce more than 6X the total processing time when using 16 parZfec encoding processes. This is a strong proof of benefit by using task parallelism on a multicore computing system.

Figure 7A: 500-data-set processing comparison using (1,2,4,8,16) parZfec processes

1260.4

630.5

335

190.1 174.9

0

200

400

600

800

1000

1200

1400

1 parZfec 2 parZfec 4 parZfec 8 parZfec 16 parZfec

Processing Time (seconds)

Processing time

Figure 7B: 500-data-set normalized processing time comparison using (1, 2, 4, 8, 16) parZfec processes, base is using “16 parZfec processes” It is also shown that using parallel encoding processes can significantly improve the encoding bandwidth. The encoding bandwith comparison is shown in Figure 8. The results have clearly demonstrated the superiority of using parallel programming on a multicore computer systems. I was able to cut overall processing and gain significant encoding bandwidth when more parZfec processes are used concurrently to handle a larget data set.

Figure 8: 500-data-set accumulated encoding bandwidth comparison using (1,2,4,8,16) parZfec

processes

7.21

3.60

1.92

1.09 1.00

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00


Normalized Processing Time (base is 16 parZfec)

Processing time

82.56

165.05

310.68

546.29589.88

0

100

200

300

400

500

600

700


Encoding Bandwidth (MB/second)

Encoding bandwidth(MB/sec)

5.3 4000-data-set Testing on two Compute Node Multicore Cluster Machines

I then conducted a three-node cluster test using the 4000-data-set. Table-2 shows the number

of MPI processes launched on this two-node cluster. The total data size from this 4000-data-set was 828.28GB. Average data size should be around 207.39MB. The smallest data size was 1MB and the largest data size was 3996MB. It is shown that using two nodes and 32 parZfec processes can reduce ovelall processing 1500% compared to using two nodes and 2 parZfec processes in Figure 9A and Figure 9B. Figure 10 show sthe bandwidth scaling factor using two compute nodes. A near linear-scaling result is shown when more parZfec processes are used in data encoding.

It is also shown that we can gain more accumulated encoding bandwidth on a multi-node

cluster. Figure 11 summarizes the encoding bandwidth from the 500-data-set and 4000-data-set. When I used double the amount of compute nodes on parallel erasure coding, I gained 82% to 94% encoding bandwidth. This result provides evidence that we can obtain an almost linear scaling of encoding bandwidth on multimode cluster based computing systems. Hence we can apply the parZfec software to handle very large data efficiently and effectively. Total MPI process launched

Total parZfec processes used

Node-1 parZfec processes used

Node-2 parZfec processes used

Comment

5 2 1 1 I use the hostlist to arrange the MPI process distribution on two-node cluster.

7 4 2 2 11 8 4 4 19 16 8 8 35 32 16 16

Table-2: 400-date-set testing MPI processes information

Figure 9A: 4000-data-set processing time comparison using (1,2,4,8,16) parZfec processes on each multicore node, total (2,4,8,16,32) parZfec processes are launched on two multicore nodes

11604.6

6035.6

3278.9

1578.4723.7

0

2000

4000

6000

8000

10000

12000

14000

1 parZfec/node 2 parZfec/node 4 parZfec/node 8 parZfec/node 16 parZfec/node

Using two Compute Nodes with Varying Number of Concurrent parZfec Encoding Processes

Processing time (seconds)

Figure 9B: 4000-data-set normalized processing time comparison using (1,2,4,8, 16) parZfec processes on each multicore node, total (2,4,8,16,32) parZfec processes are launched on two

multicore nodes

Figure 10: 4000-data-set acculuated encoding bandwidth comparison using (1,2,4,8, 16) parZfec processes on each multicore node

16.04

8.34

4.53

2.181.00

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00


Normalized Processing Time (Two Nodes)

Processing time (seconds)

150.78

317.34

600.78

1008.92

1146.30

0.00

200.00

400.00

600.00

800.00

1000.00

1200.00

1400.00


Accumulated Concurrent Encoding Bandwidth from Two Nodes

Encoding bandwidth(MB/sec)

Figure 11: Total accululated bandwidth comparison using one compute node and two compute

node 6 Conclusion and Future works

In this project, I studied parallel programming on multicore computing systems and investigated potential parallel processing capabilities using data encoding procedures. I implemented a parallel erasure coding software called parZfec. I then conducted various workload tests on a single compute node and multi-node cluster. I demonstrated the advantage of using parallel processing on multicore computing systems. The parZfec software showed significant reductions in total encoding processs time. Futhermore my results show that we can obtaing almost linear-scaling bandwidwith on single multicore machines and multiple multicore cluster based machines. I also demoststrate that the data coding problem on storage systems could be solved efficiently and effectively by task parallelism on multicore computing systems.

I plan to continue to enhance the parZfec software with the following features and present more results in the future: - Implement single very large shared object encoding with -to-N parallel I/O chunking

processes - Test parZfec with parallel file system on cloud object storage systems - Use other open-source erasure code libraries such as Jerasure[11] and Intel’s

Intellignet Storage Accelation Library (ISA-L) [12] - Combine task parallamism and data parallelism (SIMD: Single instruction and

multiple data) of data coding feature and enchance the parZfec software - Conduct power consumption tests on parallel erasure coding processes on multicore

computing systems and cross apply energy efficiency models using parallel programming on multicore computing systems.

82.56165.05

310.68

546.29 589.88

150.78

317.34

600.78

1008.92

1146.30

0

200

400

600

800

1000

1200

1400


One Node vs Two Node Total Bandwith Comparison

One Node (MB/s) Two Node (MB/s)

7 Acknowledgment I would like to thank the continued support from the New Mexcio Supercompuing

Challenge program. Furthermore, I would also like to thank the suggestions from the interim review members. Finally I would like to thank my high school Computer Science teacher, Adam Drew, for his continuous support and sponsorship of my Supercomputing Challenge project.

Reference List

1 Keith D. Cooper, The Multicore Transformation: Making Effective Use of Multicore Systems: A Software Perspective, ACM Ubiquity, September 2014

2 Hardware RAID vs. Software RAID: Which Implementation is Best for my Application? , Adaptec Storage Solutions White paper -

3 BackAbBlaze Reed-Solomon Code : https://www.backblaze.com/blog/reed-solomon/ 4 Reed-Solomon- Code:

https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction 5 NASUNI 2015 The State of Cloud Storage report,

http://www6.nasuni.com/rs/nasuni/images/Nasuni-2015-State-of-Cloud-Storage-Report.pdf

6 MPI Programming Tutorial – Lawrence Livermore National Lab: https://computing.llnl.gov/tutorials/mpi/ 7 An Introduction to MPI – Parallel Programming with Message Passing Interface, William Gropp, Ewing Lusk, Argonne National Lab http://www.mcs.anl.gov/research/projects/mpi/tutorial/mpiintro/ppframe.htm 8 Rauber, Thomas, Rünger, Gudula, Parallel Programming for Multicore and Cluster Systems 9 Yuval Cassuto, Coding Techniques for Data-Storage Systems, PhD Thesis, California Institute of Technology, 2008 10 ZFEC - a fast erasure code which can be used with the command-line, C, Python, or Haskell, https://pypi.python.org/pypi/zfec 11 James Plank, “Erasure Codes for Storage Systems A Brief Primer”, Usenix, Login 2013 12 Intel® Intelligent Storage Acceleration Library (Intel® ISA-L) -

https://software.intel.com/en-us/storage/ISA-L

https://www.backblaze.com/blog/reed-solomon/

https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction



https://software.intel.com/en-us/storage/ISA-L

Appendix A : parZfec MPI Source Code

Part-1: C Header File

/* parZfec.h */ #define MASTER 0 #define PRODUCER 1 #define EXPLORER 2 #define PDM 3 #define MAX_NUM_ENTRY 10000 #define MIN_NUM_ENTRY 10000 #define MAX_NUM_PDM_ENTRY 100 #define MIN_NUM_PDM_ENTRY 3 #define NUM_GATEWAY 9 #define NUM_CONNECTOR 9 #define FILESIZE_FROM 9437184 // 9 MiB #define FILESIZE_TO 188743680 // 180 MiB #define DEFAULT_FILESIZE 37748736 // 36MiB #define DEFAULT_NUM__ENTRY 1000 #define OneMiB 1048576 #define OneMB 1000000 #define DEFAULT_DATA_SIZE 90 // entry status #define UNASSIGNED 0 #define ASSIGNED 1 #define UNDERUPLOADED 0 #define UPLOADED 1 #define FAILEDUPLOADED 2 #define TRUE 1 #define FALSE 0 #define YES 1 #define NO 0 #define ON 1 #define OFF 0 #define UP 1 #define DOWN -1 #define INITTAG 0 #define WORKTAG 1

#define WORKDONETAG 2 #define ENDTAG 3 #define INITWORK -1 #define MAX_NUM_PROCESS 256 typedef struct workloadInfo_s { long objSize; char objName[256]; char destDir[256]; char zfecCmd[256]; int numShared; int numNeeded; int K; int M; int size; int workloadID; int assignedStatus; //unassigned or assigned int uploadStatus; // int myRank; clock_t startTime;// from time() clock_t endTime; // from time() clock_t deltaTime; // delta time double cpu_time_used; // usec double processedBandwidth;// MB/sec } workloadInfo ; struct processedWorkloadInfo { int myRank; int workloadID; time_t startUploadTime;// from time() time_t endUploadTime; // from time() time_t uploadTime; // delta time clock_t startTime; clock_t finishTime; double cpu_time_used; // usec double uploadBandwidth;// MB/sec }; struct fileInfo { char filename[256]; char destDir[256]; char zfecCmd[256];

int workloadID; ino_t inode; int size; int rank; int numShared; int numNeeded; int K; int M; int totalEncodingSize; clock_t startTime; clock_t finishTime; clock_t deltaTime; double processTime; // usec double processedBandwidth;// MB/sec }; struct statisticInfo { int rank; clock_t startTime; int startTimeFlag; clock_t finishTime; clock_t totalProcessTime; clock_t maxFinishTime; clock_t idleTime; int numData; long totalSize; long totalEncodingSize; double averageDataSize; double totalWallTime; double totalBandwidth; double averageBandwidth; double efficiency; double idlePercentage; double utilization; int processStartTimelist[MAX_NUM_ENTRY]; int processFinishTimelist[MAX_NUM_ENTRY]; };

Part-2: C Source Code

// // // parZFEC - a parallel and concurrent data encoding method using // // // #define _XOPEN_SOURCE 1 /* Required under GLIBC for nftw() */ #define _XOPEN_SOURCE_EXTENDED 1 /* Same */ #include <sys/types.h> #include <sys/stat.h> #include <stdio.h> #include <stdlib.h> #include <errno.h> #include <getopt.h> #include <ftw.h> /* gets <sys/types.h> and <sys/stat.h> for us */ #include <limits.h> /* for PATH_MAX */ #include <unistd.h> /* for getdtablesize(), getcwd() declarations */ #include <curl/curl.h> #include <fcntl.h> #include <string.h> #include <mpi.h> #include <curl/curl.h> #include <netdb.h> #include <time.h> #include <sys/times.h> #include <dirent.h> extern int h_errno; #include "parZfec.h" char *output_fname = NULL; char *config_fname = NULL; char *sourceFile = NULL; // example /memfs/36MB001 int numEntry = 0; int numDataUsed = 0; int numPDM = 1; int numGateway = 1; int allENTRYHANDLED = FALSE;

int overWriteFlag = NO; int fileSize = 37748736; int numShared = 8; int numNeeded = 5; int destDirFlag = NO; int prefixFlag = NO; int suffixFlag = NO; int helpFlag = NO; int numProcs; char *overWrite = (char *) ""; char *destDir = (char *) ""; char *srcDir = (char *) ""; char *prefix = (char *) ""; char *suffix = (char *) ""; workloadInfo workloadList[MAX_NUM_ENTRY]; struct dirent *fileList[MAX_NUM_ENTRY]; struct fileInfo fileProcessedList[MAX_NUM_ENTRY]; char *sourceFile; char zfecCmd[256]; clock_t startTimeProcess; // from times() clock_t finishTimeProcess; // from times() double cpu_time_used; // usec struct tms stimeProcess, etimeProcess; //from times() time_t startUploadTimeProcess;// from time() time_t endUploadTimeProcess; // from time() time_t uploadTimeProcess; // delta time struct tm *loctimeProcessStart; struct tm *loctimeProcessFinish; int numdataassigned = 0; int master(int, int, int); int producer(int, int); int explorer(int, int); int parZfecAgent(int, int, int *); int processInputArgument(int , char **); int workloadGenerator(int , struct dirent **fileList); int initData(int ,int , char *,char *,struct dirent **,int ,int , int * ); int zfecDecoder(char *,int ,int, char **); int zfecEncoder(char *,char *);

int finalReport(int , int); /* message buffer data type */ const int numItems=17; int blocklengths[17] = {1,256,256,256,1,1,1,1,1,1,1,1,1,1,1,1,1}; MPI_Datatype types[17] = {MPI_LONG, MPI_CHAR, MPI_CHAR, MPI_CHAR, MPI_INT, MPI_INT, MPI_INT, MPI_INT, MPI_INT, MPI_INT, MPI_INT, MPI_INT, MPI_LONG_LONG, MPI_LONG_LONG, MPI_LONG_LONG, MPI_DOUBLE, MPI_DOUBLE }; MPI_Datatype mpi_zfec_type; MPI_Aint offsets[17]; struct statisticInfo statisticList[MAX_NUM_PROCESS]; // usage // mpirun -np numProcs --hostfile hostlist parZFEC -p prefix -s suffix -n numFiles -m numShared -k numNeeded -f -s srcDir -d destDir // int main(argc, argv) int argc; char *argv[]; { int myRank; int numDataAssigned= 0; int rc = 0; char hostname[256];

// Initiaize MPI MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD,&numProcs); // number of process used in this job MPI_Comm_rank(MPI_COMM_WORLD, &myRank); printf("num process = %d\n", numProcs); printf("My rank = %d\n", myRank); /* error handler */ MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN); memset(hostname, ' ', sizeof(char)*256); gethostname(hostname, sizeof(char)*256); printf("hostname = %s\n", hostname); startUploadTimeProcess = time(NULL); // currentl local time startTimeProcess = times(&stimeProcess); loctimeProcessStart = localtime (&startUploadTimeProcess); //define message bufder data type offsets[0] = offsetof(workloadInfo, objSize); offsets[1] = offsetof(workloadInfo, objName); offsets[2] = offsetof(workloadInfo, destDir); offsets[3] = offsetof(workloadInfo, zfecCmd); offsets[4] = offsetof(workloadInfo, numShared); offsets[5] = offsetof(workloadInfo, numNeeded); offsets[6] = offsetof(workloadInfo, K); offsets[7] = offsetof(workloadInfo, M); offsets[8] = offsetof(workloadInfo, workloadID); offsets[9] = offsetof(workloadInfo, assignedStatus); offsets[10] = offsetof(workloadInfo, uploadStatus); offsets[11] = offsetof(workloadInfo, myRank); offsets[12] = offsetof(workloadInfo, startTime); offsets[13] = offsetof(workloadInfo, endTime); offsets[14] = offsetof(workloadInfo, deltaTime); offsets[15] = offsetof(workloadInfo, cpu_time_used); offsets[16] = offsetof(workloadInfo, processedBandwidth); MPI_Type_create_struct(numItems, blocklengths, offsets, types, &mpi_zfec_type); MPI_Type_commit(&mpi_zfec_type); switch (myRank) {

case MASTER: //rank 0 // process argument rc = processInputArgument(argc , argv); if(helpFlag == YES) { printf("Please try it later\n"); return(0); } // initialize Data printf("Master: myRank=%d srcDir[%s],destDir[%s],M=%d,k=%d\\n", myRank,srcDir,destDir,numShared,numNeeded); rc = initData(myRank, numProcs, srcDir, destDir, fileList, numShared, numNeeded, &numEntry); printf("Master: myRank=%d numProcs=%d, numEntry=%d\n", myRank, numProcs, numEntry); rc = master(myRank, numProcs, numEntry); printf("Main: call finalReport, numProc=%d, numFile=%d\n", numProcs, numEntry); rc = finalReport(numProcs, numEntry); break; case PRODUCER: // rank 1 printf("Poducer: myRank=%d\n", myRank); rc = producer(myRank, numProcs); break; case EXPLORER: // rank 2 printf("Explorer: myRank=%d\n", myRank); rc = explorer(myRank, numProcs); break; default: // rank 3 and ramining printf("zfec agent START : myRank=%d, enter parZfecAgent \n", myRank); rc = parZfecAgent(myRank, numProcs, &numDataAssigned); printf("zfec agent Finished: myRank=%d, enter parZfecAgent, numDataAssigned=[%d] [%d]\n", myRank, numDataAssigned, numdataassigned); break; } // generate report // finishTimeProcess = times(&etimeProcess); cpu_time_used = ((double) (finishTimeProcess - startTimeProcess)) / CLOCKS_PER_SEC;

MPI_Type_free(&mpi_zfec_type); MPI_Finalize(); /* cleanup MPI */ return(rc); } int processInputArgument(int argc , char *argv[]) { int opt = 0; int rc = 0; int M = 8; int K = 2; int forceFlag = 0; char *prefix = "" ; char *suffix = "" ; int checkARGV = 0; int prefixFlag = 0; int suffixFlag = 0; char pstr[16]; char sstr[16]; char fstr[16]; char str[64]; /* -n number of data entry -w fixed data size from input data -o redirect output to a file -d dest Directory Path -s source Directory Path -m total encoding shared data chunk will be created : default is 8 -k the number of shared data chunk required to reconstruct the original data: default is 3 */ overWrite = NO; numShared = 8; numNeeded = 5; destDirFlag = NO; memset(str, ' ', 64*sizeof(char)); memset(pstr, ' ', 16*sizeof(char)); memset(sstr, ' ', 16*sizeof(char)); memset(fstr, ' ', 16*sizeof(char)); memset(zfecCmd, ' ', 256*sizeof(char));

while ((opt = getopt(argc, argv, "h:n:o:d:m:k:p:s:f")) != EOF) { switch(opt) { case 'n': numEntry = (int ) atoi(optarg); printf("\n number of entry =%d\n", numEntry); break; case 'o': // srcDir = optarg; printf("\nsourceDir = %s\n", srcDir); break; case 'd': // destination of encoded share3d file objects destDir = optarg; destDirFlag = YES; printf("\ndestDir = %s\n", destDir); break; case 'm': numShared = atoi(optarg); printf("\nNumber of shared chunk created =%d\n", numShared); break; case 'k': numNeeded = atoi(optarg); printf("\nNumber of data chunk required to recover the original file =%d\n", numNeeded); break; case 'p': prefix = optarg; printf("\nPrefix=%s\n", prefix); prefixFlag = 1; sprintf(pstr, " -p %s ", prefix); strcat(str, pstr); break; case 's': suffix = optarg; printf("\nSuffix=%s\n", suffix); suffixFlag = 1; sprintf(sstr, " -s %s ", suffix); strcat(str, sstr); break; case 'f': forceFlag = 1; sprintf(fstr, " -f "); strcat(str, fstr); break; case 'h': printf("USAGE\n");

printf("mpirun -np numProcs --hostfile hostlist parZFEC -p prefix -s suffix -n numFiles -m numShared -k numNeeded -f -s srcDir -d destDir \n"); printf("szfec -h : how to run this szfec\n"); helpFlag = YES; break; default: rc = -1; printf("\nunknow argv =%s\n", optarg); break; } } if( (prefixFlag == 1) || (suffixFlag == 1) || (forceFlag == 1)) { sprintf(zfecCmd, "zfec --output-dir %s %s -m %d -k %d ", destDir, str, numShared, numNeeded); } else { sprintf(zfecCmd, "zfec --output-dir %s -m %d -k %d ", destDir, numShared, numNeeded); } printf("processsInput: zfecCmd=%s\n", zfecCmd); return (rc); } /* int myRank my process Rank char *srcDir source directory char *destDir destination directory struct dirent **fileList file list scanned from soruce directory int M number of shared chunks int K number of code chunk */ int initData(int myRank,int numProc, char *srcDir,char *destDir,struct dirent **fileList,int numShared,int numNeeded, int *numData) { int i; int rc =0; DIR *dirDest; DIR *dirSrc; struct dirent *dp; struct stat dirStat; struct stat fileStat;

int dirFlag = 0; char *file_name; char filepath[256]; char real_file_name[256]; int numFile =0; int fileInode; char *localSrcDir = srcDir; char *localDestDir = destDir; int K; int M; K = numNeeded; M = numShared = numNeeded; printf("\ninitData: srcDir[%s], destDir[%s], M=%d, K=%d\n", srcDir, destDir, M,K); // explore source directory and create fileProcessedList dirSrc = opendir(srcDir); // use readir to get the data set i = 0; while ((dp=readdir(dirSrc)) != NULL) { printf("initData Entry = %d\n", i); file_name = dp->d_name; fileInode = dp->d_ino; printf("file inode %ld\n", fileInode); printf("filename %s\n", file_name); memset(filepath, '\0', 256); strcat(filepath, srcDir); strcat(filepath, file_name); printf("real filename=[%s]\n", filepath); printf("real srcDir=[%s]\n", srcDir); if (strcmp(file_name, ".") == 0) { printf("filename=.\n"); } else if (strcmp(file_name, "..") == 0) { printf("filename=..\n"); } else { fileList[i] = dp;

rc = stat(filepath, &fileStat); memset(fileProcessedList[i].filename,'\0', 256 * sizeof(char)); memset(fileProcessedList[i].destDir, '\0', 256 * sizeof(char)); memset(fileProcessedList[i].zfecCmd, '\0', 256 * sizeof(char)); strcpy(fileProcessedList[i].filename, filepath); strcpy(fileProcessedList[i].destDir, destDir); strcpy(fileProcessedList[i].zfecCmd, zfecCmd); printf("\nfileProcessedList[%d].destDir[%s],destDir[%s]\n",i, fileProcessedList[i].destDir,destDir); printf("\nfileProcessedList[%d].zfecCmd=[%s]\n", i, fileProcessedList[i].zfecCmd); printf("\nfileProcessedList[%d].size =[%d][%d]\n", i, fileProcessedList[i].size, fileStat.st_size); fileProcessedList[i].inode = fileInode; fileProcessedList[i].size = fileStat.st_size; fileProcessedList[i].rank = 0; fileProcessedList[i].numShared = numShared; fileProcessedList[i].numNeeded = numNeeded; fileProcessedList[i].K = numNeeded; fileProcessedList[i].M = numShared - numNeeded; fileProcessedList[i].totalEncodingSize = 0; fileProcessedList[i].processedBandwidth = 0.0; fileProcessedList[i].startTime = 0; fileProcessedList[i].finishTime = 0; fileProcessedList[i].processTime = 0.0; numFile++; i++; } } closedir(dirSrc); numEntry = numFile; numData = &numFile; printf("workloadGenerator: number of entry = [%d][%d][%d]\n", numFile, numEntry, *numData); for(i=0; i < numEntry; i++) { memset(workloadList[i].objName,'\0', 256 * sizeof(char)); memset(workloadList[i].destDir,'\0', 256 * sizeof(char)); memset(workloadList[i].zfecCmd,'\0', 256 * sizeof(char)); strcpy(workloadList[i].objName,fileProcessedList[i].filename);

strcpy(workloadList[i].destDir,fileProcessedList[i].destDir); strcpy(workloadList[i].zfecCmd, fileProcessedList[i].zfecCmd); workloadList[i].objSize = fileProcessedList[i].size; workloadList[i].numShared = numShared; workloadList[i].numNeeded = numNeeded ; workloadList[i].K = numNeeded; workloadList[i].M = numShared - numNeeded; workloadList[i].workloadID = i; workloadList[i].assignedStatus = UNASSIGNED; workloadList[i].uploadStatus = UNDERUPLOADED; workloadList[i].myRank = 0; workloadList[i].startTime = 0; workloadList[i].endTime = 0; workloadList[i].deltaTime = 0; workloadList[i].processedBandwidth = 0.0; } for(i=0; i < numEntry; i++) { printf("fileProcessedList[%d].filename = %s\n", i, fileProcessedList[i].filename); printf("fileProcessedList[%d].destDir = %s\n", i, fileProcessedList[i].destDir); printf("fileProcessedList[%d].inode = %ld\n",i, fileProcessedList[i].inode); printf("fileProcessedList[%d].size = %d\n", i, fileProcessedList[i].size); printf("fileProcessedList[%d].rank = %d\n", i, fileProcessedList[i].rank); printf("fileProcessedList[%d].K = %d\n", i, fileProcessedList[i].K); printf("fileProcessedList[%d].M = %d\n", i, fileProcessedList[i].M); printf("workloadList[%d].objSize = %d\n", i, workloadList[i].objSize); printf("workloadList[%d].objName = %s\n", i, workloadList[i].objName); } for(i=3; i < numProc; i++) { statisticList[i].rank = 0; statisticList[i].startTime = 999999999 ; statisticList[i].startTimeFlag = 0 ; statisticList[i].finishTime = 0; statisticList[i].maxFinishTime = 0; statisticList[i].totalProcessTime = 0; statisticList[i].idleTime = 0; statisticList[i].numData = 0; statisticList[i].totalSize = 0; statisticList[i].totalWallTime = 0.0; statisticList[i].totalBandwidth = 0.0; statisticList[i].averageBandwidth = 0.0; statisticList[i].averageDataSize = 0.0; statisticList[i].efficiency = 0.0; statisticList[i].utilization = 0.0;

} return(rc); } int finalReport(int numProc, int numFile) { int i,j; int myRank; clock_t deltaTime; clock_t idleTime; clock_t totalIdleTime; clock_t maxFinishTime = -1; int startTimeFlag = 0; long startProcessingTimeGlobal; long finishProcessingTimeGlobal; long totalProcessingPeriod; double totalDataSizeMB = 0.0; long totalDataSizeBYTE = 0; double totalProcessingTime; double overallProcessedBandwidth; double overallUtilization; printf("Final Report: numProc=%d, numFile=%d\n", numProc, numFile); maxFinishTime = 1.0; for(i=0; i < numFile; i++) { totalDataSizeBYTE = totalDataSizeBYTE + fileProcessedList[i].size; if(fileProcessedList[i].startTime == fileProcessedList[i].finishTime) { fileProcessedList[i].deltaTime = 1; printf("finalReport: i=%d delta is too smalle, set it to 1\n",i); } else fileProcessedList[i].deltaTime = fileProcessedList[i].finishTime - fileProcessedList[i].startTime; fileProcessedList[i].processTime = (double ) fileProcessedList[i].deltaTime/100.0; if(fileProcessedList[i].processTime <= 0.0) {

printf("finalReport: i=%d, startTime=%d, finishTime=%d\n", fileProcessedList[i].startTime, fileProcessedList[i].finishTime); } else fileProcessedList[i].processedBandwidth = (double ) fileProcessedList[i].size / 1000000.0/fileProcessedList[i].processTime; } for(i=0; i < numFile; i++) { printf("%d %s %s %ld %ld %d %d %d %ld %ld %ld %f %f \n", i, fileProcessedList[i].filename, fileProcessedList[i].destDir, fileProcessedList[i].inode, fileProcessedList[i].size, fileProcessedList[i].rank, fileProcessedList[i].K, fileProcessedList[i].M, fileProcessedList[i].startTime, fileProcessedList[i].finishTime, fileProcessedList[i].deltaTime, fileProcessedList[i].processTime, fileProcessedList[i].processedBandwidth); } startProcessingTimeGlobal = fileProcessedList[0].startTime; for(i=0; i < numFile; i++) { myRank = fileProcessedList[i].rank; statisticList[myRank].rank = myRank; statisticList[myRank].totalProcessTime = statisticList[myRank].totalProcessTime + fileProcessedList[i].deltaTime; statisticList[myRank].totalBandwidth = statisticList[myRank].totalBandwidth + fileProcessedList[i].processedBandwidth; // set the first processed start time for this Rank if(statisticList[myRank].startTimeFlag == NO){ statisticList[myRank].startTime = fileProcessedList[i].startTime; statisticList[myRank].startTimeFlag = YES; } statisticList[myRank].finishTime = fileProcessedList[i].finishTime; statisticList[myRank].processStartTimelist[statisticList[myRank].numData] = fileProcessedList[i].startTime;

statisticList[myRank].processFinishTimelist[statisticList[myRank].numData] = fileProcessedList[i].finishTime; statisticList[myRank].numData = statisticList[myRank].numData + 1; statisticList[myRank].totalSize = statisticList[myRank].totalSize + fileProcessedList[i].size; statisticList[myRank].totalWallTime = statisticList[myRank].totalWallTime + fileProcessedList[i].processTime; statisticList[myRank].totalBandwidth = statisticList[myRank].totalBandwidth + fileProcessedList[i].processedBandwidth; if(fileProcessedList[i].finishTime > maxFinishTime) maxFinishTime = fileProcessedList[i].finishTime; } finishProcessingTimeGlobal = maxFinishTime; totalProcessingPeriod = finishProcessingTimeGlobal- startProcessingTimeGlobal; totalProcessingTime = (double ) totalProcessingPeriod / 100.0; totalDataSizeMB = (double ) totalDataSizeBYTE / 1048576.0; overallProcessedBandwidth = (double ) totalDataSizeMB / totalProcessingTime; for(i=3; i < numProc; i++) { if(statisticList[i].numData > 0){ statisticList[i].averageBandwidth = statisticList[i].totalBandwidth / statisticList[i].numData; statisticList[i].averageDataSize = (double) statisticList[i].totalSize / 1048576.0 / statisticList[i].numData; } else { printf("finalReport:statisticList[%d].numData IS ZERO\n", i); } idleTime = finishProcessingTimeGlobal - startProcessingTimeGlobal; statisticList[i].idleTime = idleTime; if(totalProcessingPeriod > 0) { statisticList[i].efficiency = (double ) (statisticList[i].totalProcessTime / totalProcessingPeriod) ; } else { printf("finalReport:totalProcessingPeriod IS ZERO\n", statisticList[i].numData ); } statisticList[i].idlePercentage = 1.0 - statisticList[i].efficiency; } for(i=0; i < numProc; i++) {

totalIdleTime = totalIdleTime + statisticList[i].idleTime + (finishProcessingTimeGlobal - statisticList[i].finishTime); } overallUtilization = 1.0 - (double )((totalIdleTime)/(totalProcessingPeriod * numProc)); /* for(i=0; i < numProc; i++) { for(j=0; i < statisticList[i].numData ; j++) { statisticList[i].processStartTimelist[j] = statisticList[i].processStartTimelist[j] - startProcessingTimeGlobal; statisticList[i].processFinishTimelist[j] = statisticList[i].processFinishTimelist[j] - startProcessingTimeGlobal; } } */ printf("startProcessingTimeGlobal = %ld ticks\n", startProcessingTimeGlobal); printf("finishProcessingTimeGlobal = %ld ticks\n", finishProcessingTimeGlobal); printf("totalProcessingPeriod = %ld ticks \n",totalProcessingPeriod); printf("totalProcessingTime = %f sec\n", totalProcessingTime); printf("totalIdleTime = %f sec\n", totalIdleTime); printf("totalDataSizeBYTE = %f B\n", totalDataSizeBYTE); printf("totalDataSizeMB = %f MB\n", totalDataSizeMB); printf("overallProcessedBandwidth = %f\n", overallProcessedBandwidth); printf("overallUtilization = %f\n", overallUtilization) ; for(i=3; i < numProc; i++) { printf("%d %ld %ld %ld %ld %ld %d %ld %f %f %f \n\n", statisticList[i].rank, statisticList[i].startTime, statisticList[i].finishTime, maxFinishTime, statisticList[i].totalProcessTime, statisticList[i].idleTime, statisticList[i].numData, statisticList[i].totalSize, statisticList[i].totalWallTime, statisticList[i].averageBandwidth, statisticList[i].averageDataSize); } }

int master(int myRank, int numProc, int numData) { int ntasks, rank; int exceptRank; int receivedRank; int workAssigned; int workDone; static int work = -1; static int workloadID ; static int wID ; workloadInfo workloadData; workloadInfo workloadReceived; workloadInfo workloadSend; int rc = 0; MPI_Status status; int error; static int curEntry = -1; int i; struct dirent **namelist; // Communication protocol between: Master, Producer, Explorer, and parZfecAgent // // [1] Master ----> work ---> parZfecAgent move item(work) to destination // send INITWROK to all processes // while(1) { // receive any incoming message // switch (status.MPI_SOURCE) { // PRODUCER: // send reply // EXPLORER: // send reply // PDM: // pdmRank = status.MPI_SOURCE // get next work item // send next work item to pdmRank // } // // [2] parZfecAgent ----> work // // if(tag == WORKTAG){ // if(work == INITWORK) // send a request to get next job assignment from Master // else // call curl-upload item(work)

// record upload result (start time, end time, size of data upload) // send a request to get next job assignment from Master // else if tag == ENDTAG // return to the main program and finish this MPI process // // [3] producer // // [4] explorer // // STEP 1: send an initial messgae to each MPI process except the producer and explorer processes for (rank = 1; rank < numProc; ++rank) { work = INITWORK; workloadSend.workloadID = INITWORK; error = MPI_Send( &workloadSend, /* message buffer */ 1, /* one data item*/ mpi_zfec_type, /* data item is an integer */ rank, /* destination process rank */ INITTAG, /* user chosen message tag */ MPI_COMM_WORLD); /* always use this */ if(error != MPI_SUCCESS) printf("MPI_Send: Master Process %d Failed on sedning an initial mesage to process %d, rc=%d \n", myRank, rank, error); else printf("MPI_Send: Master sending an Initial message to process %d\n", rank); } /* STEP 2: * Receive a result from any procs and dispatch a new work request * until all workload ve been assigned and processesed */ while (allENTRYHANDLED == FALSE) { // assign next workload entry ID to this PDM process work= curEntry++; printf("****************work=%d\n", work);

if(curEntry > numData) { printf("Master: All %d entry are handled, final work entry %d is reached\n", numData, work); allENTRYHANDLED = TRUE; break; } else { printf("Master: Next dataEntry %d \n",work); error = MPI_Recv(&workloadReceived,/* message buffer */ 1, /* one data item*/ mpi_zfec_type, /* data item is a double real */ MPI_ANY_SOURCE, /* receive from any sender */ MPI_ANY_TAG, /* receive any type of message */ MPI_COMM_WORLD, /* always use this */ &status /* info about received message */ ); if(error != MPI_SUCCESS) printf("MPI_Recv: Master Process %d Failed on receiving a mesage, rc=%d \n", myRank, error); else { receivedRank = status.MPI_SOURCE; printf("MPI_Recv: Master Received a message from process %d, TAG=%d\n", receivedRank, status.MPI_TAG); workloadID = workloadReceived.workloadID; } switch(receivedRank) { case PRODUCER: printf("Should not recevie a message from producer\n"); break; case EXPLORER: printf("Should not recevie a message from explorer\n"); break; default: // from one of PDM process if(status.MPI_TAG == WORKDONETAG || status.MPI_TAG == WORKTAG) { if(status.MPI_TAG == WORKDONETAG) { wID = workloadReceived.workloadID; printf("MPI_Recv: Master Received a WORKDONEATAG message from process %d, on entry %d\n", status.MPI_SOURCE, wID); fileProcessedList[wID].startTime = workloadReceived.startTime; fileProcessedList[wID].finishTime = workloadReceived.endTime; fileProcessedList[wID].rank = status.MPI_SOURCE; printf("MASTER received one Processed file from myRank=[%d], my workloadID = %d processedBandwidth=%f\n",

status.MPI_SOURCE, wID, fileProcessedList[wID].processedBandwidth); } else if(status.MPI_TAG == WORKTAG) printf("MPI_Recv: Master Received a WORKTAG message from process %d, on entry %d\n", status.MPI_SOURCE, work); // preapre next workload work = curEntry; workloadList[work].uploadStatus = UPLOADED; workloadList[work].assignedStatus = ASSIGNED; workloadList[work].myRank = status.MPI_SOURCE; fileProcessedList[work].rank = status.MPI_SOURCE; printf("XXXXX UPDATE fileProcessedList[%d].rank=%d\n", fileProcessedList[work].rank); printf("Master: Assgin work entry %d to parZfecAgent %d\n", work, status.MPI_SOURCE); workloadSend.objSize = workloadList[work].objSize; strcpy(workloadSend.objName,workloadList[work].objName); strcpy(workloadSend.destDir,workloadList[work].destDir); strcpy(workloadSend.zfecCmd,workloadList[work].zfecCmd); workloadSend.numShared = workloadList[work].numShared; workloadSend.numNeeded = workloadList[work].numNeeded; workloadSend.K = workloadList[work].K; workloadSend.M = workloadList[work].M; workloadSend.workloadID = work; workloadSend.myRank = workloadList[work].myRank; error = MPI_Send(&workloadSend, 1 , mpi_zfec_type, status.MPI_SOURCE, WORKTAG, MPI_COMM_WORLD ); if(error != MPI_SUCCESS) printf("MPI_Send: Master Process %d Failed on sedning a WORKTAG mesage to process %d, rc=%d \n", myRank, rank, error); else printf("MPI_Send: Master send a WORKTAG message to process %d, dataEntry %d\n", status.MPI_SOURCE, work); }

else { printf("MPI_Recv: Master Received an unexpected message from process %d, tag %d\n", status.MPI_SOURCE, status.MPI_TAG); } break; } // switch } } // while loop printf("Finalize Master Proc ......\n"); /* * Receive results for outstanding work requests. */ printf("Receive results for outstanding work requests\n"); int exitCount = 0; for (rank = 3; rank < numProc; ++rank) { error = MPI_Recv(&workloadReceived, 1, /* one data item*/ mpi_zfec_type, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status ); printf("Master: receive parZfecAgent %d final message\n", status.MPI_SOURCE); if(error != MPI_SUCCESS) printf("MPI_Recv: Master Process %d Failed on receiving a final mesage, rc=%d \n", myRank, error); else { printf("MPI_Recv: Master Received a final message from process %d\n", status.MPI_SOURCE); printf("Master: send parZfecAgent %d final message\n", rank); if(status.MPI_TAG == WORKDONETAG) { wID = workloadReceived.workloadID; printf("MPI_Recv: Master Received a WORKDONEATAG message from process %d, on entry %d\n", status.MPI_SOURCE, wID);

fileProcessedList[wID].startTime = workloadReceived.startTime; fileProcessedList[wID].finishTime = workloadReceived.endTime; fileProcessedList[wID].rank = status.MPI_SOURCE; } // send the EDNTAG to all parZfecAgent process and finalize Master process error = MPI_Send(&workloadSend, 1, /* one data item*/ mpi_zfec_type, rank, ENDTAG, MPI_COMM_WORLD ); if(error != MPI_SUCCESS) printf("MPI_Send: Master Process %d Failed on sedning a ENDTAG mesage to process %d, rc=%d \n", myRank, rank, error); else printf("MPI_Send: Master sending a ENDTAG message to process %d\n", status.MPI_SOURCE); exitCount++; } } printf(" Master received %d closeout messages \n", exitCount); printf(" Master exit\n"); return (rc); } int producer(int myRank, int numProcs) { int rc = 0; MPI_Status status; int error; int work; workloadInfo workloadReceived; workloadInfo workloadSend;

error = MPI_Recv(&workloadReceived, /* message buffer */ 1, /* one data item */ mpi_zfec_type, /* data item is a double real */ MPI_ANY_SOURCE, /* receive from any sender */ MPI_ANY_TAG, /* receive any type of message */ MPI_COMM_WORLD, /* always use this */ &status ); /* info about received message */ if(error != MPI_SUCCESS) printf("MPI_Recv: Producer Process %d Failed on receiving a mesage from unknown process, rc=%d\n", myRank, error); else printf("Producer: Producer %d Received a message from process %d\n", myRank, status.MPI_SOURCE); // Now: doing nothing printf("Producer Process %d is finished\n", myRank); return (rc); } int explorer(int myRank, int numProcs) { int rc = 0; MPI_Status status; int error; int work; workloadInfo workloadReceived; workloadInfo workloadSend; error = MPI_Recv(&workloadReceived, /* message buffer */ 1, /* one data item */ mpi_zfec_type, /* data item is a double real */ MPI_ANY_SOURCE, /* receive from any sender */ MPI_ANY_TAG, /* receive any type of message*/ MPI_COMM_WORLD, /* always use this */ &status); /* info about received message*/ if(error != MPI_SUCCESS) printf("MPI_Recv: EXPLORER Process %d Failed on receiving a mesage from unknown process, rc=%d \n", myRank, error); else

printf("Explorer: EXPLORER Process %d Received a message from process %d\n", myRank, status.MPI_SOURCE); // Now: doing nothing printf("Exploer Process %d is finished\n", myRank); return (rc); } /* parZfecAgent request workload from the Master and waiting to receive workload assignment from the master */ int parZfecAgent(int myRank, int numProcs, int *numDataAssigned) { int rc = 0; MPI_Status status; int error; int work; int workloadID; long objectSize; int delay; clock_t start, end; char cmd[1000]; double uploadSpeed; double uploadTime; double cpu_time_used; static numDataProcessed = 0; workloadInfo workloadReceived; workloadInfo workloadSend; static destDirCreated = NO; char *destDirLocal; char *cmdstr; char *filename; struct stat dirStat; struct tms stime; struct tms etime; // STEP 1: // receive an initial message from the MASTER error = MPI_Recv(&workloadReceived,

1, /* one data item*/ mpi_zfec_type, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status); /* * Check the tag of the received message. */ if(error != MPI_SUCCESS) printf("MPI_Recv: parZfecAgent Process %d Failed on receiving a mesage from unknown process, rc=%d \n", myRank, error); else { printf("parZfecAgent: Process %d Received a message from process %d\n", myRank, status.MPI_SOURCE); // STEP 2: replay message if(status.MPI_TAG == ENDTAG) { // ending this parZfecAgent process and exit printf("pdaAgent %d Receive ENDTAG and return to main stream \n", myRank); return 0; } if(status.MPI_TAG == INITTAG) { // parZfecAgent receive the first message from Master // prepare for routine processing printf("parZfecAgent %d receive the first message from Master \n", myRank); printf("parZfecAgent %d send a WORKTAG message backto Master \n", myRank); work = -1; workloadSend.workloadID = -1; error = MPI_Send(&workloadSend, 1, /* one data item*/ mpi_zfec_type, MASTER, WORKTAG, MPI_COMM_WORLD ); if(error != MPI_SUCCESS) printf("MPI_Send: PDM Process %d Failed on sedning a WORKTAG mesage to MASTER process, rc=%d \n", myRank, error); else

printf("MPI_Send: PDM Process %d send Master sending a WORKTAG message to MASTER process\n", myRank); } } // STEP 3: Regular processing for (;;) { error = MPI_Recv(&workloadReceived, 1, /* one data item*/ mpi_zfec_type, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status ); /* * Check the tag of the received message. */ if(error != MPI_SUCCESS) printf("MPI_Recv: parZfecAgent Process %d Failed on receiving a mesage from unknown process, rc=%d \n", myRank, error); else printf("parZfecAgent: parZfecAgent Process %d Received a message from process %d\n", myRank, status.MPI_SOURCE); if (status.MPI_TAG == ENDTAG) { numDataAssigned = &numDataProcessed; numdataassigned = numDataProcessed; printf("YYYY parZfecAgent Process %d reveived an ENDTAG message from process %d\n", myRank, status.MPI_SOURCE); printf("parZfecAgent Process %d is finished and returned to main, total processed data[%d][%d]", myRank, numDataAssigned, numdataassigned); return 0; } if (status.MPI_TAG == WORKTAG) { numDataProcessed++; printf("PDM Process %d reveived an WORKTAG message from process %d on entery %d \n", myRank, status.MPI_SOURCE, workloadReceived.workloadID);

// workload information workloadID = workloadReceived.workloadID; filename = workloadReceived.objName; destDirLocal = workloadReceived.destDir; cmdstr = workloadReceived.zfecCmd; objectSize = workloadReceived.objSize; numShared = workloadReceived.numShared; numNeeded = workloadReceived.numNeeded; printf("parZfecAgent %d: receive WORKTAG workloadID=%d, filename=%s, destDir=%s, size=%d, numShared=%d, numNeeded=%d\n", myRank, workloadID, filename, destDirLocal, objectSize, numShared, numNeeded); printf("parZfecAgent: cmd=%s\n", cmdstr); // // call zfec // // check the destDir creation status: // only need to created once // if(destDirCreated == NO) { rc = stat(destDirLocal, &dirStat); if(S_ISDIR(dirStat.st_mode)) { printf("destination Directory is exist destDir[%s]\n", destDirLocal); destDirCreated = YES; } else { // destination Directory is not exist, create one printf("destination Directory is not exist, create one destDir[%s]\n", destDirLocal); rc = mkdir(destDirLocal, S_IRWXU | S_IRWXG | S_IROTH | S_IXOTH); if(rc != 0) { printf("ERROR: parFEC initData - Cannot create Destination directory [%s]\n", destDirLocal); } else destDirCreated = YES; } } start = times(&stime); rc = zfecEncoder(cmdstr,filename); end = times(&etime); workloadSend.workloadID = workloadID; workloadSend.assignedStatus = ASSIGNED; workloadSend.uploadStatus = UPLOADED;

workloadSend.myRank = status.MPI_SOURCE; workloadSend.startTime = start; workloadSend.endTime = end; error = MPI_Send(&workloadSend, 1, /* one data item*/ mpi_zfec_type, MASTER, WORKDONETAG, MPI_COMM_WORLD ); if(error != MPI_SUCCESS) printf("MPI_Send: PDM Process %d Failed on sedning a WORKDONETAG mesage to MASTER process, rc=%d \n", myRank, error); else printf("MPI_Send: PDM Process %d Master sending a WORKDONETAG message to MASTER process, current processed entry %d\n", myRank, work); } } //for loop return 0; } int zfecEncoder(char *cmd, char *filename) { char cmdstr[1024]; printf("zfecEncoder: cmd = %s\n", cmd); sprintf(cmdstr, "%s %s ", cmd, filename); printf("zfecEncoder: cmdstr=[%s]\n", cmdstr); system(cmdstr); return (0); }

Part-3: Build Script

mpicc $1.c -Wimplicit-function-declaration -lmpi -o $1

Documents

A Concurrent and Energy Efficient Approach of Data Coding on Multicore Computing Systems · 2016-04-06 · A Concurrent and Energy Efficient Approach of Data Coding on Multicore Computing