Upload
viet-nt
View
1.306
Download
3
Embed Size (px)
DESCRIPTION
Facebook is one of the top sites on the internet and supports more than 900 million users. It handles billions of messages, hundreds of millions of photos, and generates hundreds of terabytes of data - every day! This data is also becoming more complex and interconnected over time. Every page the site serves, requires processing large amounts of data and needs to be rendered in milliseconds. Business and practical constraints dictate that more users are served with less resources. In addition, product changes regularly occur in a rapid manner. These constraints dictate that the site requires an infrastructure that is scalable, fast, efficient and flexible beyond what has been built ever before. In this talk, we will share key learning from our experience in building an infrastructure that addresses the above challenges. In particular, we will discuss key components of the Facebook software architecture, instrumentation and data collection mechanisms that allow us to monitor the health of the site, and innovative tools that analyze vast amount of data to help us pre-empt site issues and help identify root causes when things go wrong. We describe how this infrastructure and tools allow the engineers to move fast and rapidly launch products as Facebook builds for a billion users and beyond.
Citation preview
Enabling Fast Pages and Furious Development While Supporting a Billion Users
Subbu Subramanian, Ph.D. Software Engineer
Pottery Challenge
Pottery Challenge: Day 3
Team 1: Make a PERFECT Pot Team 2: Make 20 pots
950,000,000
700B minutes spent
on the site every month
2.5M sites using
social plugins
30B pieces of content
shared each month
500M daily active users
Latests stats @ http://newsroom.fb.com/content/default.aspx?NewsAreaId=22
UNIQUE TECHNOLOGY CHALLENGES Require tackling some
Scaling traditional websites
Bob
Bob’s data
Bob
Bob’s data
Memcache: Slide 1: story of crossing over different schools: New school = cluster of machines and they were not connected then the campuses could be connected which through everything into the blender IN order to make it scale we looked in Open Source and found Memcache: Memcache ran out of steam and we took some engineers to help figure out how Memcache wasn’t working Slide of graph going up from MIT as things bottlenecked we solved the problems in memcache narrative: we solve one things and then things began to break. so successful in some areas that we blue up our switches. Graphic of school - then multiple schools as bubbles and isolated - then schools opening up to other schools and crashing the system which is why we went to Memcache. Slide 2: graph Slide 3: stats
Scaling Facebook: Interconnected data show one animate in at a time and animate their lines pop out more stories
Bob
Scaling Facebook: Interconnected data show one animate in at a time and animate their lines pop out more stories
Bob Brian
Scaling Facebook: Interconnected data show one animate in at a time and animate their lines pop out more stories
Bob Brian Felicia
News Feed
950 million unique home pages
Multifeed
Multifeed
Actor ID - Object ID - Story type
Actor ID - Object ID - Story type
Stories for up to 5000 friends in milliseconds
Blank photo/ActorID: Object ID/Storytype grey out the text blocks and superimpose the actual copy
Examining thousands of stories to find the 45 most interesting stories out of thousands and returned in milliseconds
Memcache
TAO (Custom Cache
1 Billion operations per second
800M
New Apps February 2004
Sign Up
NewsFeed 2006
Platform launch 2007
Translations 2008
The Stream 2009
Open Graph 2010
</> Social Plugins 2010
Photos Update 2010
Places 2010
Mobile Event 2010
Groups 2010
Messages 2010
New Profile 2010
Questions 2011
? Unified Mobile Sites
while supporting growth …
2011 2004
New Apps 2004/2005
Pottery Challenge
Team 1: Make a PERFECT Pot Team 2: Make 20 pots
Scale���
Photo by Eole: at http://www.flickr.com/photos/eole/2193801804// and used under Creative Commons license
Move Fast
Move Fast
Moving fast does not mean poor quality We want a high ship rate
Invest in removing friction that slows us down
Starting on day ONE
Follow your passion – pick your team Push any time you want to
…
Empower Engineers
Commits per Month
1/1/2006 1/1/2012 1/1/2007 1/1/2008 1/1/2009 1/1/2010 1/1/2011
Be Bold
and innovate
Move Fast
and build things
Scale Big
with min resources
OMG!
Be Bold
and innovate
Move Fast
and build things
Scale Big
with min resources
How can Infrastructure support these goals?
q Pre-empt issues before they hit production
q Know immediately when things go wrong
q Know what to do when things go wrong
=> LOTS OF INSTRUMENTATION, TOOLS and AUTOMATION
News feed
Perflab (aka Difflab)
• Performance test every commit
• Spot regressions before deploy
News feed
Perflab (aka Difflab)
• Also tracks slow drift regressions
• Helps us push thousands of revs per week
News feed
Gatekeeper
if!(gk_check(‘secret_project’, $user) {
!render_cool_feature();} else {
!render_normal_feature();}!
Simple code but powerful check
• Many options for precise targeting
• 500M+ gatekeeper checks performed every second
Rigorous Test Coverage
Assigning ownership to failures
Canary Tier and Delta view
Be Bold
and innovate
Move Fast
and build things
Scale Big
with min resources
q Pre-empt issues before they hit production
q Know immediately when things go wrong
q Know what to do when things go wrong
*Lots* of Instrumentation = Fire Hose of Data
News feed
Claspin
• High-density heatmap viewer for large services
• Find needles in a haystack -> drilldown quickly
tasks sevmanager logview testconsole
differential wirehog domino groups
hipal hsh hud kobold
ods opsfeed scuba serf
Be Bold
and innovate
Move Fast
and build things
Scale Big
with min resources
q Pre-empt issues before they hit production
q Know immediately when things go wrong
q Know what to do when things go wrong
Scuba: a tool for diving into an emerald sea of data
Motivation
Requirements for data exploration
Need
ü Speed
ü Real-time data
ü Ad-hoc filtering and grouping
ü AVG, SUM, COUNT, histograms & percentiles
Don’t Need
⤫ Replication
⤫ Transactions
⤫ Long retention
⤫ Table joins
⤫ Unique keys
⤫ Full map-reduce
Hive (hadoop)
• “Unlimited” storage/CPU
• Full-featured Query Language
• Numerous tools and Frameworks
But Slow!
MySQL
Works, but …
Scuba: Data Model
Few pre-defined types and operations
“Data Sets”
- No upfront schema declaration
- Stored In memory
- Sorted by Time stamp
Scuba Data Types: Integers
VLQ encoded array of 64bit integers (single char*)
O(1) lookup
Usage
Aggregate on these (SUM, AVG, etc...)
Filter (==, <=, >=)
“Which pages on the site have an average wall time in the last hour > 2 seconds”
Scuba Data Types: Normals
Strings mapped to ints, Stored as array of ints
size: 4 bytes
String Normalization
// char* => int32
'home.php', ’dc1', 'a2', 'en_US' => 32, 14, 3, 289
Usage
Group By value
Filter (==, !=, in set, not in set)
“Top 10 countries that have the slowest pages today”
Scuba Data Types: Denorm
Array of plain ol' char*
To be used ONLY for unique identifiers that would not benefit from normalization
size: 8 + strlen + 1 bytes
Usage
None other than displaying the value
Cannot filter or group by these. No native regex support.
Scuba Data Types: Tagsets
A set of normals. Stored as bit vector
size: 8 + 2 + ceil(I / 8) bytes where I is the max index represented
Usage
Filter (has all tags, has some tags, has none of these tags)
Bit Set
'timeline', 'mercury', 'titan_drafts' => 0, 5, 14
“Is there a difference in cpu usage across users in test group A vs test group B”
Storage
1 aggregator / box
8+ leaves / box
Hundreds of boxes
Data stored over Terabytes of RAM!
Leaves
Multiple per machine ( #cores / N )
Only queried by the aggregator on the local machine
Persist all write traffic to disk (compressed). Replay all writes on startup
Store all samples efficiently in memory
Leaves are independent; No shared state
Aggregation
Queries distributed, and aggregated as a binary tree
(For now, there is no sorting of results. All aggregation operations must be commutative and associative.)
Operations
4 functions:
visit, summarize, combine, and finalize
Also a Hive-SQL like query language interface
Querying Front-end
tasks sevmanager logview testconsole
differential wirehog domino groups
hipal hsh hud kobold
ods opsfeed scuba serf
Be Bold
and innovate
Move Fast
and build things
Scale Big
with min resources