Cloud workload analysis and simulation

Investigating the behaviors of a cloud, focusing on their workload patterns

Jan – May 2014

TABLE OF CONTENTSContents

Highlights:_________________________________________________________________________________________________1

The approach........................................................................................................................................................................................... 1

Dataset preprocessing and analysis...............................................................................................................................................1

Clustering analysis................................................................................................................................................................................. 1

Time series analysis..............................................................................................................................................................................1

Workload prediction............................................................................................................................................................................. 1

Looking ahead..........................................................................................................................................................................................1

1.Objective:________________________________________________________________________________________________2

2.The Approach:___________________________________________________________________________________________3

3.Dataset preprocessing and analysis____________________________________________________________________4

3.1 Preprocessing___________________________________________________________________________________________4

3.2 Analysis:_________________________________________________________________________________________________4

4.Calculation resource usage statistics:__________________________________________________________________6

5.Classification of users and identifying target users:__________________________________________________8

6.Time series analysis___________________________________________________________________________________10

7.Workload Prediction___________________________________________________________________________________11

9.Issues faced and possible solutions:__________________________________________________________________14

10.Looking ahead________________________________________________________________________________________15

GROUP MEMBERS_______________________________________________________________________________________16

References:_______________________________________________________________________________________________16

Highlights:

The approach Studied google trace data schema Studied related technical papers and summarized useful observations Devised an approach to analyze cloud workload using observations from technical papers and considering

google trace data’s schema

Dataset preprocessing and analysis Preprocessed the data to prepare it for analysis Visualized important statistics for feasibility decision and computed relevant attributes The main attributes were analyzed and visualized and observations were made.

Clustering analysis Applied various clustering algorithm, compared the results and chose the best clustering for user and task

analysis Users were classified primarily based on estimation ratios and tasks based on cpu, memory usage

Time series analysis Target users and their task were identified from the clustering results Dynamic time warping algorithm was run on tasks to identify patterns in their resource usage

Workload prediction Users with specific resource usage patterns are identified Resource for users is allocated based on the identified usage pattern with a threshold value

Looking ahead Improvisations to our approach

Page 1

1. Objective:

Analyze and report the cloud workload data based on Google cloud trace Use graphical tools to visualize the data (you may need to write programs to process the data in order to feed them into visual tools) Study and summarize the papers regarding other people’s experience in Google cloud trace

analysis Determine the workload characteristics from your analysis of the Google cloud trace Try to reallocate unused resources of a user to other users who require them

Page 2

2. The Approach:

Based on our study on the google cloud trace data and the gathered observations from the technical papers we devised the following approach for the problem:

Analyze and visualize the data to identify important attributes that determine user workload pattern and ignore rest of the attributes

Calculate resource usage statistics of users to identify the feasibility of resource re-allocation Classify users based on their resource usage quality[1] (amount of unused resource/resource

requested) using clustering analysis Identify target users based on the clustering analysis for resource re-allocation Study the workload pattern of tasks of the target users and classify tasks based on their lengths Perform time series analysis on long tasks Identify (if there is) a pattern for a user and associate that pattern for that user (or) form clusters

of tasks of all users that have similar workload based on time series analysis Predict the usage pattern of a user if the current task’s pattern matches the pattern associated

with that user (or) matches the one of the cluster formed in the previous step.

Page 3

3. Dataset preprocessing and analysis

3.1 Preprocessing

Inconsistent and vague data was processed to perform analysis. The task-usage table has many records for a same Job ID-task index pair because the same task might be re-submitted or re-scheduled due to task failure. So to avoid reading many values for the same Job ID-Task index pair pre-processing was done.

Pre-processing: All records were grouped by Job ID-Task index and the last occurring record of repeating task records was considered and stored as a single record.

Time is converted into days and hours for per day analysis

3.2 Analysis:

The data in the cloud trace tables were visualized. The data which were found to be constant/within a small range of values for most of the records were not considered for analysis. The attributes that play a major part in shaping the user profile and task profile are considered important attributes. The main attributes from a table were analyzed and visualized and certain observations were made.

Figure 1CPU requested per user (blue) Vs CPU used per user (red)

Observation: Most users overestimate the resources they need and use less than 5% of the requested resources

A few users under estimate the resources and use more than thrice the amount of requested resources.

Page 4

Figure 2Memory requested per user (blue) Vs Memory used per user (red)

Users with negative (orange) memory estimation ratio have used resources more than requested.

Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.

Page 5

4. Calculation resource usage statistics:

As we are concerned with re-allocation of unused resources, we should look at those users who over-estimate the resources as observed in the previous section.

To identify those users who over-estimate the resources a new attribute is calculated.

Estimation ratio [1] = (requested resource – used resource)/requested resource.

Estimation ratio varies from 0 to 1.

0 – User has used up/more than the requested resource

1 – User has not used any of the requested resource

Also from the visualizations and observations made, the following are identified as important attributes:

User: Submission rate, CPU estimation ratio, Memory estimation ratio

Task: Task length, CPU usage, Memory usage

Figure 3CPU Estimation ratio per User

Users with negative (red) CPU estimation ratio have used resources more than requested.

Users with CPU estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.

Page 6

Figure 4Memory Estimation ratio per User

Users with negative (orange) memory estimation ratio have used resources more than requested.

Users with memory estimation ratio between 0.9 and 1 have not used more than 90% of the requested resource.

Page 7

5. Classification of users and identifying target users:

The dimensions for classification are

User: Submission rate, CPU estimation ratio, Memory estimation ratio

We use the following clustering algorithms to identify optimal number of clusters for users and tasks

K- means Expectation – Maximization (EM) Cascade Simple K-means X-means[2]

We categorize the users and tasks using these clustering algorithms with the above dimensions for users.

We compare and choose the best clustering for users and tasks.

K-means (4 clusters) EM Clustering

Page 8

X- means Cascade Simple K-means

K means clustering with 4 clusters was selected as it offers good clustering of users based on the CPU and memory estimation ratios.

From the clustering results we observed:

97% of the users have estimation ratios ranging from 0.7-1.0. That is 97% of the users don’t user more than 70% of the resources they request. We targeted User Cluster 0 & Cluster 3 ( more than 90 % unused)

We targeted tasks that were long enough to perform efficient resource allocation. We performed clustering on task lengths of these users to filter out short tasks

Page 9

6. Time series analysis

To identify user’s tasks with similar workload, we ran the DTW[3] algorithm on each tasks of Cluster0 and Cluster3 users, computed the DTW between all target user’s tasks and a reference sine curve (refer Issues faced section)

Clustered tasks that have same DTW value

These tasks were identified to have similar workload curve.

Two tasks with same DTW distances having similar workloads

The clusters hence formed a reference workload curve was randomly selected from one of task’s workload in the group of tasks in that cluster. (due to time constraint)

Page 10

7. Workload Prediction

When a user from the targeted list of users issues a task, the task’s workload is studied for pre-determined amount of time. This time period was determined by trial and error basis, as the minimum time at which all reference curves are different.

During this time period, the task’s workload is compared with the reference curve of all task clusters formed in the previous step.

If the current task’s workload curve has zero distance with one of the reference curves i.e., similar to a reference curve, the current curve is expected to behave similar to the reference curve and its workload is predicted.

Since resource allocation and de-allocation cannot be done dynamically because of:

Huge overhead Delay in allocating resources

Resource allocation must happen once in every pre-determined interval of time and cannot happen continuously.

Hence for stealing the resource, the allocation and re-allocation is a step-function. Based on the predicted curves slope a step-up or step-down is performed. Also a threshold value is set to accommodate unexpected spikes in the workload.

Successful prediction:

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991000

0.01

0.02

0.0300000000000001

0.0400000000000001

0.0500000000000001

0.0600000000000001

0.0700000000000001

0.0800000000000002

0.0900000000000002

0.1

0.11

0.12

Efficient resource allocation curve

Used

Allocated

Req

Time (s)

CPU

Usa

ge

Average unused resource: 94% Average resource stolen:65% (Req – Allocated)

Page 11

Failed prediction:

1 16 31 46 61 76 91 1061211361511661811962112262412562712863013163313463610

0.005

0.01

0.015

0.02

0.025

Allocated

Usage

Time (s)

CPU

use

Reason: The above chart shows a case where our algorithm has failed to predict correctly. This is because of the random selection of the reference curve for task clusters. Though the randomly selected reference curve has generated a descent resource allocation curve, there are points at which the current task spikes and exceeds the allocated resource. The solution to this issue is discussed in the Issues faced section.(Solution discussed later)

Page 12

8. Tools and Algorithms Used: JAVA: For extracting required data out of the datasets, we used Java programming (csv

reader/writer, hashmaps). DTW on MATLAB: Implemented DTW using Matlab’s in-built function. WEKA 3.7: To run clustering algorithms – K-means, EM, Cascade Simple K-Means, Xmeans TABLEAU 8.1: To visualize the datasets and results. Naïve Bayes on MATLAB for choosing right cluster: Couldn’t use this because the data was

continuous and the algorithm needed discrete data. Co-relation on MATLAB: Since DTW was a better option for comparing 2 curves, we dropped this

algorithm.

Page 13

9. Issues faced and possible solutions:

MATLAB crashing: While executing DTW to each task of each user (which comes up to nearly 9000), MATLAB crashed which was rectified by running the data in batches. DTW algorithm in MATLAB takes 2 vectors as input, changes them to matrix and multiplies which increases the time complexity to great extent.

MATLAB Numeric: We had problem in getting MATLAB to take user as a String data type along with other parameters. We had to run Java programs to map users of tasks to the corresponding DTW values.

Naïve Bayes algorithm: Learning from the existing data and predicting the given test curve using Naïve Bayes would have been a better prediction – but since the data was continuous, we couldn’t implement this algorithm.

We initially considered per user’s tasks and ran DTW algorithm on them to identify if a user has a recurring workload pattern. As very few users had such workload pattern, we ended up ignoring lot of data. So we considered tasks of all users for DTW instead per user’s task.

Page 14

10. Looking ahead

10.1Improvements and optimizations Choosing a good reference curve for running DTW was difficult. Having a straight line as a

reference curve gave us mediocre results as curves with peaks at different instances of time were grouped as similar. So we compared results with the line x=y and sine curve as reference curve and we got good results for sine curve.

Choosing a representative curve for a task cluster was performed on a random basis due to time constraint. This can be bettered using curve fitting algorithm to get an overall reference curve for a cluster.

During prediction incidents such as the workload of a task changing to look like some other cluster is not handled now. This can be handled by comparing the current task’s workload with all cluster’s reference curve continuously and when current task looks like shifting to some other cluster, the step curve can be mapped to the new cluster’s reference curve dynamically. This constant monitoring and dynamic mapping improves the prediction accuracy.

Page 15

GROUP MEMBERS

PRABHAKAR GANESAMURHTY

PRIYANKA MEHTA ARUNRAJA SRINIVASAN ABINAYA SHANMUGARAJ

[email protected] [email protected] [email protected]

[email protected]

References:

1. Solis Moreno, I, Garraghan, P, Townend, PM and Xu, J (2013) An Approach for Characterizing Workloads in Google Cloud to Derive Realistic Resource Utilization Models. In: Service Oriented System Engineering (SOSE), 2013 IEEE 7th International Symposium on. UNSPECIFIED. IEEE, 49 - 60 (12). ISBN 978-1-4673-5659-6

2. X-means: Extending K-means with Efficient Estimation of the Number of Clusters (2000) by Dau Pelleg , Andrew Moore

3. Wang, Xiaoyue, et al. "Experimental comparison of representation methods and distance measures for time

series data." Data Mining and Knowledge Discovery (2010): 1-35.

Page 16

Documents

Cloud workload analysis and simulation