Upload
cricket
View
43
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Scalable Internet Services. Cluster Lessons and Architecture Design for Scalable Services BR01, TACC, Porcupine, SEDA and Capriccio. Outline. Overview of cluster services lessons from giant-scale services (BR01) SEDA (staged event-driven architecture) Capriccio. Scalable Servers. - PowerPoint PPT Presentation
Citation preview
Scalable Internet Services
Cluster Lessons and Architecture Design for Scalable Services
BR01, TACC, Porcupine, SEDA and Capriccio
Ben Y. Zhao [email protected]
Outline• Overview of cluster services
• lessons from giant-scale services (BR01)
• SEDA (staged event-driven architecture)
• Capriccio
Ben Y. Zhao [email protected]
Scalable Servers• Clustered services
• natural platform for large web services
• search engines, DB servers, transactional servers
• Key benefit• low cost of computing, COTS vs. SMP
• incremental scalability
• load balance traffic/requests across servers
• Extension from single server model• reliable/fast communication, but partitioned
data
Ben Y. Zhao [email protected]
Goals• Failure transparency
• hot-swapping components w/o loss of avail
• homogeneous functionality and/or replication
• Load balancing• partition data / requests for max service rate
• need to colocate requests w/ associated data
• Scalability• aggregate performance should scale w/# of
servers
Ben Y. Zhao [email protected]
Two Different Models• Read-mostly data
• web servers, DB servers, search engines (query)
• replicate across servers + (RR DNS / redirector)
IP Network (WAN)IP Network (WAN)
clientclient client client
client
Round Robin DNS
Ben Y. Zhao [email protected]
Two Different Models …• Read-write model
• mail servers, e-commerce sites, hosted services
• small(er) replication factor for stronger consistency
IP Network (WAN)IP Network (WAN)
clientclient client client
client
Load Redirector
Ben Y. Zhao [email protected]
Key Architecture Challenges
• Providing high availability• availability across component failures
• Handling flash crowds / peak load• need support for massive concurrency
• Other challenges• upgradability: maintaining availability and
minimal cost during upgrades in S/W, H/W, functionality
• error diagnosis: fast isolation of failures / performance degradation
Ben Y. Zhao [email protected]
Nuggets• Definition
• uptime = (MTBF – MTTR)/MTBF
• yield = queries completed / queries offered
• harvest = data available / complete data
• MTTR• at least as important at MTBF
• much easier to tune and quantify
• DQ principle• data/query x queries/second constant
• physical bottlenecks limit overall throughput
Ben Y. Zhao [email protected]
Tapestry Software Architecture
SEDA event-driven frameworkJava Virtual Machine
Dynamic Tap.
distance map
core router
application programming interface
applications
Patchwork
network
Ben Y. Zhao [email protected]
Impact of Correlated Events
• web / application servers• independent requests• maximize individual throughput
Network
???
???
?
ABC
• correlated requests: A+B+CD• e.g. online continuous queries, sensor
aggregation, p2p control layer, streaming data mining
event handler
+ + =
Ben Y. Zhao [email protected]
Capriccio• User-level light-weight threads
(SOSP03)
• Argument• threads are the natural programming model
• current problems result of implementation• not fundamental flaw
• Approach• aim for massive scalability
• compiler assistance
• linked stacks, block graph scheduling
Ben Y. Zhao [email protected]
The Price of Concurrency• Why is concurrency hard?
• Race conditions
• Code complexity
• Scalability (no O(n) operations)
• Scheduling & resource sensitivity
• Inevitable overload
• Performance vs. Programmability• No good solution
PerformanceEase
of
Pro
gra
mm
ing Threads
Threads
Events
Ideal
Ben Y. Zhao [email protected]
The Answer: Better Threads
• Goals• Simple programming model
• Good tools & infrastructure• Languages, compilers, debuggers, etc.
• Good performance
• Claims• Threads are preferable to events
• User-Level threads are key
Ben Y. Zhao [email protected]
“But Events Are Better!”• Recent arguments for events
• Lower runtime overhead
• Better live state management
• Inexpensive synchronization
• More flexible control flow
• Better scheduling and locality
• All true but…• Lauer & Needham duality argument
• Criticisms of specific threads packages
• No inherent problem with threads!
Ben Y. Zhao [email protected]
Criticism: Runtime Overhead
• Criticism: Threads don’t perform well for high concurrency
• Response• Avoid O(n) operations
• Minimize context switch overhead
• Simple scalability test• Slightly modified GNU Pth
• Thread-per-task vs. single thread
• Same performance!
Requ
ests
/ Sec
ond
Concurrent Tasks
Event-Based Server
Threaded Server
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
1 10 100 1000 10000 100000 1e+06
Ben Y. Zhao [email protected]
Criticism: Synchronization• Criticism: Thread synchronization is
heavyweight• Response
• Cooperative multitasking works for threads, too!
• Also presents same problems• Starvation & fairness• Multiprocessors• Unexpected blocking (page faults, etc.)
• Both regimes need help• Compiler / language support for concurrency• Better OS primitives
Ben Y. Zhao [email protected]
Criticism: Scheduling
• Criticism: Thread schedulers are too generic• Can’t use application-specific information
• Response• 2D scheduling: task & program location
• Threads schedule based on task only• Events schedule by location (e.g. SEDA)
• Allows batching• Allows prediction for SRCT
• Threads can use 2D, too!• Runtime system tracks current location• Call graph allows prediction
Task
Pro
gra
m
Loca
tion
Threads
Events
Ben Y. Zhao [email protected]
The Proof’s in the Pudding• User-level threads package
• Subset of pthreads
• Intercept blocking system calls
• No O(n) operations
• Support > 100K threads
• 5000 lines of C code
• Simple web server: Knot• 700 lines of C code
• Similar performance• Linear increase, then steady
• Drop-off due to poll() overhead
0
100
200
300
400
500
600
700
800
900
1 4 16 64 256 1024 4096 16384
KnotC (Favor Connections)
KnotA (Favor Accept)
Haboob
Concurrent Clients
Mbit
s /
seco
nd
Ben Y. Zhao [email protected]
Arguments For Threads• More natural programming model
• Control flow is more apparent
• Exception handling is easier
• State management is automatic
• Better fit with current tools & hardware• Better existing infrastructure
Ben Y. Zhao [email protected]
Why Threads: control Flow• Events obscure control flow
• For programmers and tools
Threads Eventsthread_main(int sock) {
struct session s;
accept_conn(sock, &s);
read_request(&s);
pin_cache(&s);
write_response(&s);
unpin(&s);
}
pin_cache(struct session *s) {
pin(&s);
if( !in_cache(&s) )
read_file(&s);
}
AcceptHandler(event e) {
struct session *s = new_session(e);
RequestHandler.enqueue(s);
}
RequestHandler(struct session *s) {
…; CacheHandler.enqueue(s);
}
CacheHandler(struct session *s) {
pin(s);
if( !in_cache(s) ) ReadFileHandler.enqueue(s);
else ResponseHandler.enqueue(s);
}
. . .
ExitHandler(struct session *s) {
…; unpin(&s); free_session(s); }
AcceptConn.
WriteResponse
ReadFile
ReadRequest
PinCache
Web Server
Exit
Ben Y. Zhao [email protected]
Threads Eventsthread_main(int sock) {
struct session s;
accept_conn(sock, &s);
read_request(&s);
pin_cache(&s);
write_response(&s);
unpin(&s);
}
pin_cache(struct session *s) {
pin(&s);
if( !in_cache(&s) )
read_file(&s);
}
CacheHandler(struct session *s) {
pin(s);
if( !in_cache(s) ) ReadFileHandler.enqueue(s);
else ResponseHandler.enqueue(s);
}
RequestHandler(struct session *s) {
…; CacheHandler.enqueue(s);
}
. . .
ExitHandler(struct session *s) {
…; unpin(&s); free_session(s);
}
AcceptHandler(event e) {
struct session *s = new_session(e);
RequestHandler.enqueue(s); }
• Events obscure control flow• For programmers and tools
AcceptConn.
WriteResponse
ReadFile
ReadRequest
PinCache
Web Server
Exit
Why Threads: control Flow
Ben Y. Zhao [email protected]
Why Threads: Exceptions• Exceptions complicate control flow
• Harder to understand program flow• Cause bugs in cleanup code
AcceptConn.
WriteResponse
ReadFile
ReadRequest
PinCache
Web Server
Exit
Threads Eventsthread_main(int sock) {
struct session s;
accept_conn(sock, &s);
if( !read_request(&s) )
return;
pin_cache(&s);
write_response(&s);
unpin(&s);
}
pin_cache(struct session *s) {
pin(&s);
if( !in_cache(&s) )
read_file(&s);
}
CacheHandler(struct session *s) {
pin(s);
if( !in_cache(s) ) ReadFileHandler.enqueue(s);
else ResponseHandler.enqueue(s);
}
RequestHandler(struct session *s) {
…; if( error ) return; CacheHandler.enqueue(s);
}
. . .
ExitHandler(struct session *s) {
…; unpin(&s); free_session(s);
}
AcceptHandler(event e) {
struct session *s = new_session(e);
RequestHandler.enqueue(s); }
Ben Y. Zhao [email protected]
Why Threads: State Management
Threads Eventsthread_main(int sock) {
struct session s;
accept_conn(sock, &s);
if( !read_request(&s) )
return;
pin_cache(&s);
write_response(&s);
unpin(&s);
}
pin_cache(struct session *s) {
pin(&s);
if( !in_cache(&s) )
read_file(&s);
}
CacheHandler(struct session *s) {
pin(s);
if( !in_cache(s) ) ReadFileHandler.enqueue(s);
else ResponseHandler.enqueue(s);
}
RequestHandler(struct session *s) {
…; if( error ) return; CacheHandler.enqueue(s);
}
. . .
ExitHandler(struct session *s) {
…; unpin(&s); free_session(s);
}
AcceptHandler(event e) {
struct session *s = new_session(e);
RequestHandler.enqueue(s); }
AcceptConn.
WriteResponse
ReadFile
ReadRequest
PinCache
Web Server
Exit
• Events require manual state management• Hard to know when to free
• Use GC or risk bugs
Ben Y. Zhao [email protected]
Why Threads: Existing Infrastructure
• Lots of infrastructure for threads• Debuggers
• Languages & compilers
• Consequences• More amenable to analysis
• Less effort to get working systems
Ben Y. Zhao [email protected]
Building Better Threads• Goals
• Simplify the programming model• Thread per concurrent activity• Scalability (100K+ threads)
• Support existing APIs and tools
• Automate application-specific customization
• Mechanisms• User-level threads
• Plumbing: avoid O(n) operations
• Compile-time analysis
• Run-time analysis
Ben Y. Zhao [email protected]
Case for User-Level Threads
• Decouple programming model and OS• Kernel threads
• Abstract hardware• Expose device concurrency
• User-level threads• Provide clean programming model• Expose logical concurrency
• Benefits of user-level threads• Control over concurrency model!
• Independent innovation
• Enables static analysis
• Enables application-specific tuning
Threads
App
OS
User
Ben Y. Zhao [email protected]
Case for User-Level Threads
Threads
OS
User
App
• Decouple programming model and OS• Kernel threads
• Abstract hardware• Expose device concurrency
• User-level threads• Provide clean programming model• Expose logical concurrency
• Benefits of user-level threads• Control over concurrency model!
• Independent innovation
• Enables static analysis
• Enables application-specific tuningSimilar argument tothe design of overlay
networks
Ben Y. Zhao [email protected]
Capriccio Internals• Cooperative user-level threads
• Fast context switches• Lightweight synchronization
• Kernel Mechanisms• Asynchronous I/O (Linux)
• Efficiency• Avoid O(n) operations • Fast, flexible scheduling
Ben Y. Zhao [email protected]
Safety: Linked Stacks• The problem: fixed stacks
• Overflow vs. wasted space• LinuxThreads: 2MB/stack• Limits thread numbers
• The solution: linked stacks• Allocate space as needed• Compiler analysis
• Add runtime checkpoints • Guarantee enough space until
next check
Fixed Stacks
Linked Stack
waste
overflow
Ben Y. Zhao [email protected]
Linked Stacks: Algorithm
5
4
2
6
3
3
2
3
• Parameters• MaxPath• MinChunk
• Steps• Break cycles• Trace back
• chkpts limit MaxPath length
• Special Cases• Function pointers• External calls• Use large stack
MaxPath = 8
Ben Y. Zhao [email protected]
Linked Stacks: Algorithm
5
4
2
6
3
3
2
3
MaxPath = 8
• Parameters• MaxPath• MinChunk
• Steps• Break cycles• Trace back
• chkpts limit MaxPath length
• Special Cases• Function pointers• External calls• Use large stack
Ben Y. Zhao [email protected]
Linked Stacks: Algorithm
5
4
2
6
3
3
2
3
MaxPath = 8
• Parameters• MaxPath• MinChunk
• Steps• Break cycles• Trace back
• chkpts limit MaxPath length
• Special Cases• Function pointers• External calls• Use large stack
Ben Y. Zhao [email protected]
Linked Stacks: Algorithm
5
4
2
6
3
3
2
3
MaxPath = 8
• Parameters• MaxPath• MinChunk
• Steps• Break cycles• Trace back
• chkpts limit MaxPath length
• Special Cases• Function pointers• External calls• Use large stack
Ben Y. Zhao [email protected]
Linked Stacks: Algorithm
5
4
2
6
3
3
2
3
MaxPath = 8
• Parameters• MaxPath• MinChunk
• Steps• Break cycles• Trace back
• chkpts limit MaxPath length
• Special Cases• Function pointers• External calls• Use large stack
Ben Y. Zhao [email protected]
Linked Stacks: Algorithm
5
4
2
6
3
3
2
3
MaxPath = 8
• Parameters• MaxPath• MinChunk
• Steps• Break cycles• Trace back
• chkpts limit MaxPath length
• Special Cases• Function pointers• External calls• Use large stack
Ben Y. Zhao [email protected]
Special Cases• Function pointers
• categorize f* by # and type of arguments
• “guess” which func will/can be called
• External functions• users annotate trusted stack bounds on libs
• or (re)use a small # of large stack chunks
• Result• use/reuse stack chunks much like VM
• can efficiently share stack chunks
• memory-touch benchmark, factor of 3 reduction in paging cost
Ben Y. Zhao [email protected]
Scheduling: Blocking Graph
• Lessons from event systems• Break app into stages
• Schedule based on stage priorities
• Allows SRCT scheduling, finding bottlenecks, etc.
• Capriccio does this for threads• Deduce stage with stack traces at
blocking points
• Prioritize based on runtime information
Accept
Write
Read
Read
Open
Web Server
Close
Close
Ben Y. Zhao [email protected]
Resource-Aware Scheduling
• Track resources used along BG edges• Memory, file descriptors, CPU
• Predict future from the past
• Algorithm• Increase use when underutilized• Decrease use near saturation
• Advantages• Operate near the knee w/o thrashing
• Automatic admission control
Accept
Write
Read
Read
Open
Web Server
Close
Close
Ben Y. Zhao [email protected]
Pitfalls• What is the max amt of resource?
• depends on workload• e.g.: disk thrashing depends on sequential
or random seeks• use early signs of thrashing to indicate max
capacity
• Detecting thrashing• only estimate using “productivity/overhead”• productivity from guessing (threads
created, files opened/closed)
Ben Y. Zhao [email protected]
Thread Performance
Capriccio
Capriccio-notrace
LinuxThreads
NPTL
Thread Creation 21.5 21.5 37.5 17.7
Context Switch 0.56 0.24 0.71 0.65
Uncontested mutex lock
0.04 0.04 0.14 0.15
• Slightly slower thread creation• Faster context switches
• Even with stack traces!• Much faster mutexes
Time of thread operations (microseconds)
Ben Y. Zhao [email protected]
Runtime Overhead• Tested Apache 2.0.44• Stack linking
• 78% slowdown for null call• 3-4% overall
• Resource statistics• 2% (on all the time)• 0.1% (with sampling)
• Stack traces• 8% overhead
Ben Y. Zhao [email protected]
Microbenchmark: Producer / Consumer
Ben Y. Zhao [email protected]
Web Server Performance
Ben Y. Zhao [email protected]
Example of “Great Systems Paper”
• observe higher level issue• threads vs. event programming abstraction
• use previous work (duality) to identify problem• why are threads not as efficient as events?
• good systems design• call graph analysis for linked stacks• resource aware scheduling
• good execution• full solid implementation• analysis leading to full understanding of detailed
issues
• cross-area approach (help from PL research)
Ben Y. Zhao [email protected]
Acknowledgements• Many slides “borrowed” from the
respective talks / papers:• Capriccio (Rob von Behren)
• SEDA (Matt Welsh)
• Brewer01: “Lessons…”