22
R. Hashemian 1 , D. Krishnamurthy 1 , M. Arlitt 2 , N. Carlsson 3 The 4th ACM/SPEC International Conference on Performance Engineering ICPE 2013 1. University of Calgary 2. HP Labs 3. Linköping University by: Raoufehsadat Hashemian

R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

Embed Size (px)

Citation preview

Page 1: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

R. Hashemian1, D. Krishnamurthy1, M. Arlitt2, N. Carlsson3

The 4th ACM/SPEC International Conference on

Performance Engineering ICPE 2013

1. University of Calgary

2. HP Labs

3. Linköping University

by: Raoufehsadat Hashemian

Page 2: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

• Introduction

• Scalability Evaluation

• Scalability Enhancement Approach

• Validation

• Conclusion

Improving the Scalability of a Multi-core Web Server ICPE13 2

OUTLINE

Page 3: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

• Enterprise applications

• Performance: Improving QoS

• e.g. Lower response times

• Cost: Less money spent on hardware

• e.g. Improving effective utilization

• Goal: Higher utilization and acceptable response time

• How to achieve this “Goal” for Web servers running on Multi-

core hardware?

INTRODUCTION

PROBLEM DESCRIPTION

Improving the Scalability of a Multi-core Web Server ICPE13 3

0

10

20

30

40

0 50 100 R

esp

on

se

Tim

e

CPU Utilization (%)

Page 4: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

• Web servers before multi-core

• Mature topic, wide-ranging discussions

• Multi-core architecture

• Most research on batch (non-interactive) workload

• Web servers running on Multi-core

• BUS problem in UMA system (Veal et al.`07)

• Multiple Web server instances: 1 instance per

processor (Scogland et al.`09, Boyd et.al,10 Gaud et. al,11)

INTRODUCTION

BACKGROUND

Improving the Scalability of a Multi-core Web Server ICPE13 5

Page 5: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

SCALABILITY EVALUATION

SCALABILITY MEASUREMENT

• Measure Web server scalability for two workloads

• Evaluate the effectiveness of multiple Web server

approach in the server’s scalability

• Scalability

• Maximum Achievable Throughput (MAT)

Improving the Scalability of a Multi-core Web Server ICPE13 4

Page 6: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

• 2 x 4 core Intel Xeon E5620 processors

NUMA Architecture

• OS: Linux, kernel 3, Ubuntu

• Webserver: Lighttpd

• Application Server: php (FastCGI module)

SCALABILITY EVALUATION

EXPERIMENTAL SETUP

Nehalem Microarch.

2.4 GHz Frequency

32K IC - 32K DC L1 Cache

256K L2 Cache

12M (Inclusive) L3 Cache

QPI -5.86 GT/s Inter-conn.

16GB - DDR3-1333 Memory

Processor 0

L3

C

0

C

2

C

4

C

6

L2

L1 L1 L1 L1

L2 L2 L2

Processor 1

L3

C

0

C

2

C

4

C

6

L2

L1 L1 L1 L1

L2 L2 L2

Memory

Bank 0

Memory

Bank 1

Improving the Scalability of a Multi-core Web Server ICPE13 6

Page 7: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

• TCP/IP Intensive workload

• High TCP connection rate

• Processing: low user level & high kernel level

• 1 KB static file, up to 155,000 requests/second

• SPECweb Support workload

• Both static requests and php requests

• Wider range of request types

• Processing: high user level & moderate kernel level

SCALABILITY EVALUATION

WORKLOADS

Improving the Scalability of a Multi-core Web Server ICPE13 7

Page 8: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

• Change default lighttpd recommendation (1 Lighttpd worker

process per core)

• Disable default Linux scheduling (use affinity)

• Distribute interrupt handling load

SCALABILITY EVALUATION

CONFIGURATION TUNING

• Improved MAT up to 69%

• Balanced utilization levels for the eight cores

• Fully utilized the server

Improving the Scalability of a Multi-core Web Server ICPE13 8

Page 9: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

• TCP/IP Intensive workload

• Sub-linear Maximum Achievable Throughput

146,000 req/sec

• SPECweb Support workload

• Almost linear Maximum Achievable Throughput

23,000 req/sec

Scala

bility

S

cala

bility

SCALABILITY EVALUATION

RESULTS

Improving the Scalability of a Multi-core Web Server ICPE13 9

Number of Cores

Page 10: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

10-1

100

101

102

103

0

0.2

0.4

0.6

0.8

1

x = Response time (msec)

P [

X <

= x

]

1 Core 2 Core 4 Core 8 Core

Response time vs. Core Count

• “Low response time” requests

• Static requests

• Performance degrades

• “High response time” requests

• Dynamic requests

• Performance improves

Knowing this behavior, how can we

improve the scalability?

SCALABILITY EVALUATION

RESPONSE DISTRIBUTION ANALYSIS

Improving the Scalability of a Multi-core Web Server ICPE13 10

CDF of Response times

80% CPU Utilization

SPECweb Support Workload

100

102

0.9

0.92

0.94

0.96

0.98

1

Page 11: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

SCALABILITY ENHANCEMENT

MULTIPLE WEBSITE REPLICAS

• Approach: Use 1 Web server instance per processor

• Goal: Reduce inter-processor data migration

Processor 0 Processor 1

Single Replica Process NIC1 Queue NIC 2 Queue

NIC 2

NIC 1

Replica 1 Process NIC 1 Queue Replica 2 Process NIC 2 Queue

Processor 0 Processor 1

NIC 2

NIC 1

Original Configuration

with one replica

Alternative Configuration

with two replicas

Improving the Scalability of a Multi-core Web Server ICPE13 11

Page 12: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

Request rate (req/sec)

Response tim

e (

ms)

TCP/IP Intensive Workload

Scalability Improvement

MAT increment: 12.3%

SPECweb Support Workload

Scalability Degradation

MAT decrement: 10%

SCALABILITY ENHANCEMENT

EVALUATING NEW CONFIGURATION

Improving the Scalability of a Multi-core Web Server ICPE13 12

Request rate (req/sec)

Response tim

e (

ms)

Page 13: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

• Hypothesis:

• Cache contention with 2-

replicas due to the larger

working set size of dynamic

requests

• The response time inflation for Dynamic requests dominates the improvement

achieved for Static requests

• Mean and 99.9th percentile response times increase with 2-replicas

CDF of Response times

80% CPU Utilization

22,000 req/sec

SPECweb Support workload

P [X

<=

x]

Response Time (ms)

100

102

104

0.9

0.95

1

SCALABILITY ENHANCEMENT

EVALUATING NEW CONFIGURATION

Improving the Scalability of a Multi-core Web Server ICPE13 13

Page 14: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

Request Rate (req/sec)

Inte

r-co

nn

ect T

raffic

(Byte

s/s

ec)

Request Rate (req/sec)

VALIDATION

INTER-CONNECT TRAFFIC

Inte

r-co

nn

ect T

raffic

(Byte

s/s

ec)

• Inter-connect traffic decreased

significantly

• Improved performance

Improving the Scalability of a Multi-core Web Server ICPE13 14

TCP/IP Intensive Workload SPECweb Support Workload

• No significant decrement

• Improved performance for Static

requests

Page 15: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

L3 C

ache H

IT R

atio

Request Rate (req/sec)

Improving the Scalability of a Multi-core Web Server ICPE13 15

VALIDATION

LAST LEVEL CACHE

SPECweb Support Workload

• Last Level cache (LLC) HIT ratio degrades with 2-replica configuration

Confirms the cache contention hypothesis

Page 16: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

CONCLUSIONS

• Multi-core Web server: scalable after tuning

• 80% utilization with acceptable response time

• Multiple Website Replicas

• The effect on the scalability is workload dependent

• Dynamic requests trigger LLC contention

• Contention may be architecture and application dependent

• Future plan:

• Design and develop an automatic, workload adaptive technique

which decides about best configuration

Improving the Scalability of a Multi-core Web Server ICPE13 16

Page 17: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

This work is financially supported by:

Raoufeh Hashemian

University of Calgary, Canada

[email protected]

Improving the Scalability of a Multi-core Web Server ICPE13 17

Page 18: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

REFERENCES

• Cherkasova et al.`00:Characterizing Temporal Locality and its Impact on Web Server Performance,

International Conference on Computer Communications and Networks’00, Cherkasova; Ciardo; HP Labs

• Elnozahy et al.`03: Energy Conservation Policies for Web Servers, USITS '03, Elnozahy; Kistler; Ramakrishnan; IBM

• Majo et al.`12: Matching Memory Access Patterns and Data Placement for NUMA Systems, GC’12, Majo;

Gross; ETH

• Blagodurov et al.`11: A case for NUMA-aware contention management on multicore systems, USENIX

ATC'11, Blagodurov; Zhuravlev; Dashti; Fedorova; SFU

• Veal et al.`07: Performance scalability of a multi-core web server. ACM/IEEE ANCS’07,Veal; Foong; Intel

• Scogland et al.`09: Asymmetric interactions in symmetric multi-core systems: Analysis, enhancements and

evaluation. ACM/IEEE SC’08, Scogland; Balaji; Feng; Narayanaswamy,

• Boyd et.al`10: An analysis of linux scalability to many cores, USENIX OSDI’10, Boyd-Wickizer; Clements;

Mao; Pesterev; Kaashoek; Morris; Zeldovich, MIT

• Gaud et. al`11: Application-level optimizations on numa multicore architectures: the apache case study, RR-

LIG-011, Gaud; Lachaize; Lepers; Muller; Quema.

Improving the Scalability of a Multi-core Web Server ICPE13 -2

Page 19: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

• Network interrupt handling

• 4 RSS queue per NIC port

• Each queue bind to one core

SCALABILITY EVALUATION

CONFIGURATION TUNING

Improving the Scalability of a Multi-core Web Server ICPE13 -3

0.0

0.5

1.0

1.5

2.0

0 50,000 100,000 150,000 200,000

Re

spo

nse

tim

e (

mse

c)

Rate (req/sec)

Before Distributing Int. Load

After Distributing Int. Load

Page 20: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

• OS scheduling

• Binding each lighttpd process to 1 core

SCALABILITY EVALUATION

CONFIGURATION TUNING

Improving the Scalability of a Multi-core Web Server ICPE13 -4

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 50000 100000 150000 200000

Re

spo

nse

tim

e (

mse

c)

Rate (req/sec)

No Affinity

With affinity

Page 21: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

• Static: Requests with lower response time

• Processed only in Web tier (lighttpd)

• Dynamic: Requests with higher response time

• Processed only in Web and application tiers (lighttpd and php)

Re

sp

on

se

tim

e (

ms)

SCALABILITY EVALUATION

WEB TIER VS. APPLICATION TIER

Improving the Scalability of a Multi-core Web Server ICPE13 -5

File size (Byte)

Page 22: R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

SCALABILITY EVALUATION

EXPERIMENTAL SETUP

Improving the Scalability of a Multi-core Web Server ICPE13 -6