R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International

R. Hashemian1, D. Krishnamurthy1, M. Arlitt2, N. Carlsson3

The 4th ACM/SPEC International Conference on

Performance Engineering ICPE 2013

1. University of Calgary

2. HP Labs

3. Linköping University

by: Raoufehsadat Hashemian

• Introduction

• Scalability Evaluation

• Scalability Enhancement Approach

• Validation

• Conclusion

Improving the Scalability of a Multi-core Web Server ICPE13 2

OUTLINE

• Enterprise applications

• Performance: Improving QoS

• e.g. Lower response times

• Cost: Less money spent on hardware

• e.g. Improving effective utilization

• Goal: Higher utilization and acceptable response time

• How to achieve this “Goal” for Web servers running on Multi-

core hardware?

INTRODUCTION

PROBLEM DESCRIPTION


0

10

20

30

40

0 50 100 R

esp

on

se

Tim

e

CPU Utilization (%)

• Web servers before multi-core

• Mature topic, wide-ranging discussions

• Multi-core architecture

• Most research on batch (non-interactive) workload

• Web servers running on Multi-core

• BUS problem in UMA system (Veal et al.`07)

• Multiple Web server instances: 1 instance per

processor (Scogland et al.`09, Boyd et.al,10 Gaud et. al,11)

INTRODUCTION

BACKGROUND


SCALABILITY EVALUATION

SCALABILITY MEASUREMENT

• Measure Web server scalability for two workloads

• Evaluate the effectiveness of multiple Web server

approach in the server’s scalability

• Scalability

• Maximum Achievable Throughput (MAT)


• 2 x 4 core Intel Xeon E5620 processors

NUMA Architecture

• OS: Linux, kernel 3, Ubuntu

• Webserver: Lighttpd

• Application Server: php (FastCGI module)


EXPERIMENTAL SETUP

Nehalem Microarch.

2.4 GHz Frequency

32K IC - 32K DC L1 Cache

256K L2 Cache

12M (Inclusive) L3 Cache

QPI -5.86 GT/s Inter-conn.

16GB - DDR3-1333 Memory

Processor 0

L3

C

0

C

2

C

4

C

6

L2

L1 L1 L1 L1

L2 L2 L2

Processor 1

L3

C

0

C

2

C

4

C

6

L2

L1 L1 L1 L1

L2 L2 L2

Memory

Bank 0

Memory

Bank 1


• TCP/IP Intensive workload

• High TCP connection rate

• Processing: low user level & high kernel level

• 1 KB static file, up to 155,000 requests/second

• SPECweb Support workload

• Both static requests and php requests

• Wider range of request types

• Processing: high user level & moderate kernel level


WORKLOADS


• Change default lighttpd recommendation (1 Lighttpd worker

process per core)

• Disable default Linux scheduling (use affinity)

• Distribute interrupt handling load


CONFIGURATION TUNING

• Improved MAT up to 69%

• Balanced utilization levels for the eight cores

• Fully utilized the server


• TCP/IP Intensive workload

• Sub-linear Maximum Achievable Throughput

146,000 req/sec

• SPECweb Support workload

• Almost linear Maximum Achievable Throughput

23,000 req/sec

Scala

bility

S

cala

bility


RESULTS


Number of Cores

10-1

100

101

102

103

0

0.2

0.4

0.6

0.8

1

x = Response time (msec)

P [

X <

= x

]

1 Core 2 Core 4 Core 8 Core

Response time vs. Core Count

• “Low response time” requests

• Static requests

• Performance degrades

• “High response time” requests

• Dynamic requests

• Performance improves

Knowing this behavior, how can we

improve the scalability?


RESPONSE DISTRIBUTION ANALYSIS


CDF of Response times

80% CPU Utilization

SPECweb Support Workload

100

102

0.9

0.92

0.94

0.96

0.98

1

SCALABILITY ENHANCEMENT

MULTIPLE WEBSITE REPLICAS

• Approach: Use 1 Web server instance per processor

• Goal: Reduce inter-processor data migration

Processor 0 Processor 1

Single Replica Process NIC1 Queue NIC 2 Queue

NIC 2

NIC 1

Replica 1 Process NIC 1 Queue Replica 2 Process NIC 2 Queue

Processor 0 Processor 1

NIC 2

NIC 1

Original Configuration

with one replica

Alternative Configuration

with two replicas


Request rate (req/sec)

Response tim

e (

ms)

TCP/IP Intensive Workload

Scalability Improvement

MAT increment: 12.3%


Scalability Degradation

MAT decrement: 10%


EVALUATING NEW CONFIGURATION


Request rate (req/sec)

Response tim

e (

ms)

• Hypothesis:

• Cache contention with 2-

replicas due to the larger

working set size of dynamic

requests

• The response time inflation for Dynamic requests dominates the improvement

achieved for Static requests

• Mean and 99.9th percentile response times increase with 2-replicas

CDF of Response times

80% CPU Utilization

22,000 req/sec

SPECweb Support workload

P [X

<=

x]

Response Time (ms)

100

102

104

0.9

0.95

1


EVALUATING NEW CONFIGURATION


Request Rate (req/sec)

Inte

r-co

nn

ect T

raffic

(Byte

s/s

ec)


VALIDATION

INTER-CONNECT TRAFFIC

Inte

r-co

nn

ect T

raffic

(Byte

s/s

ec)

• Inter-connect traffic decreased

significantly

• Improved performance


TCP/IP Intensive Workload SPECweb Support Workload

• No significant decrement

• Improved performance for Static

requests

L3 C

ache H

IT R

atio



VALIDATION

LAST LEVEL CACHE


• Last Level cache (LLC) HIT ratio degrades with 2-replica configuration

Confirms the cache contention hypothesis

CONCLUSIONS

• Multi-core Web server: scalable after tuning

• 80% utilization with acceptable response time

• Multiple Website Replicas

• The effect on the scalability is workload dependent

• Dynamic requests trigger LLC contention

• Contention may be architecture and application dependent

• Future plan:

• Design and develop an automatic, workload adaptive technique

which decides about best configuration


This work is financially supported by:

Raoufeh Hashemian

University of Calgary, Canada

[email protected]


REFERENCES

• Cherkasova et al.`00:Characterizing Temporal Locality and its Impact on Web Server Performance,

International Conference on Computer Communications and Networks’00, Cherkasova; Ciardo; HP Labs

• Elnozahy et al.`03: Energy Conservation Policies for Web Servers, USITS '03, Elnozahy; Kistler; Ramakrishnan; IBM

• Majo et al.`12: Matching Memory Access Patterns and Data Placement for NUMA Systems, GC’12, Majo;

Gross; ETH

• Blagodurov et al.`11: A case for NUMA-aware contention management on multicore systems, USENIX

ATC'11, Blagodurov; Zhuravlev; Dashti; Fedorova; SFU

• Veal et al.`07: Performance scalability of a multi-core web server. ACM/IEEE ANCS’07,Veal; Foong; Intel

• Scogland et al.`09: Asymmetric interactions in symmetric multi-core systems: Analysis, enhancements and

evaluation. ACM/IEEE SC’08, Scogland; Balaji; Feng; Narayanaswamy,

• Boyd et.al`10: An analysis of linux scalability to many cores, USENIX OSDI’10, Boyd-Wickizer; Clements;

Mao; Pesterev; Kaashoek; Morris; Zeldovich, MIT

• Gaud et. al`11: Application-level optimizations on numa multicore architectures: the apache case study, RR-

LIG-011, Gaud; Lachaize; Lepers; Muller; Quema.

Improving the Scalability of a Multi-core Web Server ICPE13 -2

• Network interrupt handling

• 4 RSS queue per NIC port

• Each queue bind to one core




0.0

0.5

1.0

1.5

2.0

0 50,000 100,000 150,000 200,000

Re

spo

nse

tim

e (

mse

c)

Rate (req/sec)

Before Distributing Int. Load

After Distributing Int. Load

• OS scheduling

• Binding each lighttpd process to 1 core




0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 50000 100000 150000 200000

Re

spo

nse

tim

e (

mse

c)

Rate (req/sec)

No Affinity

With affinity

• Static: Requests with lower response time

• Processed only in Web tier (lighttpd)

• Dynamic: Requests with higher response time

• Processed only in Web and application tiers (lighttpd and php)

Re

sp

on

se

tim

e (

ms)


WEB TIER VS. APPLICATION TIER


File size (Byte)


EXPERIMENTAL SETUP


Documents

R. Hashemian , D. Krishnamurthy , M. Arlitt , N. Carlssonnikca89/papers/icpe13.slides.pdfR. Hashemian 1, D. Krishnamurthy , M. Arlitt2, N. Carlsson3 The 4th ACM/SPEC International