Click here to load reader
Upload
huguk
View
1.766
Download
2
Embed Size (px)
Citation preview
A Sneak Peek into StumbleUpon’s Infrastructure
Quick SU Intro
Our Traffic
Our Stack:100% Open-Source
• MySQL (legacy source of truth)• Memcache (lots)• HBase (most new apps / features)• Hadoop (DWH, MapReduce, Hive, ...)• elasticsearch (“you know, for search”)• OpenTSDB (distributed monitoring)• Varnish (HTTP load-balancing)• Gearman (processing off the fast path)• ... etc
In prodsince ’09
The InfrastructureArista 7050 Arista 7050
Arista 7048T Arista 7048T Arista 7048T Arista 7048T
2 core switches
. . .
1U
2U
1U 52 x 10GbESFP+
48x1GbEcopper
4x10GbESFP+
Thin Nodes
ThickNodes
2U
L3 ECMP
MTU=9000. . .
The Infrastructure• SuperMicro half-width motherboards• 2 x Intel L5630 (40W TDP) (16 hardware threads total)• 48GB RAM• Commodity disks (consumer grade SATA 7200rpm)• 1x2TB per “thin node” (4-in-2U) (web/app servers, gearman, etc.)• 6x2TB per “thick node” (2-in-2U) (Hadoop/HBase, elasticsearch, etc.)
(86 nodes = 1PB)
The Infrastructure• No virtualization• No oversubscription• Rack locality doesn’t matter much (sub-100µs RTT across racks)• cgroups / Linux containers to keep MapReduce under control
Two production HBase clusters per colo• Low-latency (user-facing services)• Batch (analytics, scheduled jobs...)
Low-Latency Cluster• Workload mostly driven by HBase• Very few scheduled MR jobs• HBase replication to batch cluster• Most queries from PHP over Thrift
Challenges:• Tuning Hadoop for low latency• Taming the long latency tail• Quickly recovering from failures
Batch Cluster• 2x more capacity• Wildly changing workload (e.g. 40K 14M QPS)• Lots of scheduled MR jobs• Frequent ad-hoc jobs (MR/Hive)• OpenTSDB’s data >800M data points added per day 133B data points totalChallenges:• Resource isolation• Tuning for larger scale
Questions?
We’re hiringThink this is cool?