Upload
tao-cheng
View
3.415
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Maintaining a constantly updated large data set alone is a big challenging not only to database administrators but also to developers as it is hard to maintain and expand. It adds more stress when the requirement is to serve real time data to heavy traffic websites.In this presentation, we first examine the initial characteristics of AOL’s Real Time News system, the design strategy, and how MySQL fits into the overall architecture. We then review the issues encountered and the solutions applied when the system characteristics changed due to ever growing data set size and new query patterns.In addition to common MySQL design, trouble-shooting, and performance tuning techniques, we will also share a heuristic algorithm implemented in the application servers to reduce the response time of complex queries from hours to a few milliseconds.
Citation preview
Apr. 13, 2011 Presented to MySQL Conference
Building and Deploying Large Scale Real Time News System with
MySQL and Distributed Cache
Who am I?
Tao Cheng <[email protected]>, AOL Real Time News (RTN).
Worked on Mail and Browser clients in the ‘90 and then moved to web backend servers since.
Not an expert but am happy to share my experience and brainstorm solutions.
Presentation for [CLIENT]
Page 2
Agenda
AOL Real Time News (RTN): what it is? Requirements Technical solutions with focus on MySQL Deployment Topology Operational Monitoring Metrics Collection
Agenda
Tips for query tuning and optimization Heuristic Query Optimization Algorithm Lessons learned Q & A
Real Time News : background
Presentation for AOLU Un-University
Page 5
AOL deployed its large scale Real Time News (RTN) system in 2007. This system ingests and processes news from 30,000 sources on every second around the clock. Today, its data store, MySQL, has accumulated over several billions of rows and terabytes of data. However, news are delivered to end users in close to real time fashion. This presentation shares how it is done and the lessons learned.
Brief Intro: sample features
Data presentation: return most recent news in flat view – most recent news about an entity. An entity could
be a person, a company, a sports team, etc. topic clusters – most recent news grouped by topics. A topic is
a group of news about an event, headline news, etc.
News filtering by source types such as news, blogs, press releases, regional, etc. relevancy level (high, medium, low, etc) to the entities .
Data Delivery: push (to subscribers) and pull Search by entities, categories (National, Sports,
Finance, etc), topics, document ID, etc. Presentation for [CLIENT]
Page 6
Requirements for Phase I (2006)
Commodity hardware: 4 CPU, 16 GB MEM, 600 GB disk space.
Data ingestion rate = 250K docs/day; average document size = 5 KB.
Data retention period: 7 days to forever Est. data set size: (1.25 GB/day or 456 GB/year) +
space for indexes, schema change, and optimization. Response time: < 30 milli-second/query Throughputs: > 400 queries/sec/server Up time: 99.999% Presentation for [CLIENT]
Page 7
Solutions: MySQL + Bucky
MySQL Serve raw/distinct queries Back fill
Bucky Technology (AOL’s distributed cache & computing framework) Write ahead cache: pre-compute query results and push them
into cache. Messaging (optional): push data directly to subscribers
Updates are pushed to data consumers or browsers via AIM Complex.
Updates go to both database and cache.
Presentation for [CLIENT]
Page 8
Architecture Diagram (over-simplified)
Presentation for [CLIENT]
Page 9
Asset DB
AIM
Distributed Cache
Distributed Cache Gateway
Gateway
Ingestor pull
push
WWW
WWW
Relegence
Data Model: SOR v.s. Query DB
Separate query from storage to keep tables small and query fast.
System of Record (SOR): has all raw data The authoritative data store; designed for data storage Normalized schema: for simple key look-up; no table join.
Query DB – de-normalized for query speed avoid JOIN, reduce # of trips to DB, increase throughputs.
Read/write small chunk of data at a time so database can get requests out quickly and process more.
Use replication to achieve linear scalability for read.
Presentation for [CLIENT]
Page 10
Design Strategies: partitioning (Why)
Dataset too big to fit on one host Performance consideration: divide and conquer
Write: more masters (Nx) to take writes Read: smaller tables + more (NxM) slaves to handle read.
Fault tolerance – distribute the risk and reduce the impact of system failure
Easier Maintenance – size does matter Faster nightly backup, disaster recovery, schema change, etc. Faster optimization –need optimization to reclaim disk space
after deletion, rebuild indexes to improve query speed.
Presentation for [CLIENT]
Page 11
Design Strategies: partitioning (How)
Partition on most used keys (look at query patterns) Document table – on document ID Entity table – on entity ID
Simple hash on IDs – no partition map; thus no competition of read/write locks on yet another table
Managing growth: add another partition set New documents are written into both old and new partition
sets for a few weeks. Then, stop writing into the old partitions. Queries go to the new partitions first and then the old ones if
in-sufficient results found.
Works great in our case but might not for everyone. Presentation for [CLIENT]
Page 12
Schema design: De-normalization
Make query tables small: put only essential attributes in the de-normalized tables store long text attributes in separate tables.
De-normalization: how to store and match attributes Single value attributes (1:1) : document ID, short string, date
time, etc. – one column, one row. Multi-value attributes (1:many): tricky but feasible
Use multiple rows with composite index/key: (c1, c2, etc.) One row one column: CSV string, e.g., “id1, id2, id3” – SQL: “val
like ‘%id2%’” One row but multiple columns, e.g., group1, group2, etc. – SQL:
group1=val1 OR group2=val2 ...
Presentation for [CLIENT]
Page 13
Tips for indexing
Simple key – for metadata retrieval Composite key – find matching documents
Start with low cardinality and most used columns Order matter: (c1, c2, c3) != (c2, c3, c1)
InnoDB – all secondary indexes contain primary key Make primary key short to keep index size small Queries using secondary index references primary key too.
Integer v.s. String – comparison of numeric values is faster => index hash values of long string instead.
Index length – title:varchar(255) => idx_title(32) Enforce referential integrity on application side. Presentation for [CLIENT]
Page 14
MySQL configuration
Storage engine: InnoDB – row level locking Table space – one file per table
Easier to maintain (schema change, optimization, etc.)
Character set: ‘UTF-8’ Disable persistent connection (5.0.x) skip-character-set-client-handshake
Enable slow query log to identify bad queries. System variables for memory buffer size
innodb_buffer_pool_size: data and indexes Sort_buffer_size, max_heap_table_size, tmp_table_size Query cache size=0; tables are updated constantly
Presentation for [CLIENT]
Page 15
Runtime statistics (per server)
Average write rate: daily: < 40 tps max at 400 tps during recovery Perform best when write rate < 100 tps
Query rate: 20~80 qps Query response time – shorter when indexes and
data are in memory 75%: ~3 ms when qps < 15; ~2 ms when qps ~= 60 95%: 6~8 ms when qps < 15; 3~4 ms when qps ~= 60 CPU Idle %: > 99%.
Presentation for [CLIENT]
Page 16
Presentation for [CLIENT]
Page 17
Deployment Topology Consideration
• Minimum configuration: host/DC redundency • DC1: host 1 (master), host 3 (slave) • DC2: host 2 (failover master), host 4 (slave)
• Data locality: significant when network latency is a concern (100 Mbps) • 3,000 qps when DB is on remote host. • 15,000 qps when DB is on local host.
• Linking dependent servers across data centers • Push cross link up as far as possible (Topology 3): link to
dependent servers in the same data center.
Presentation for [CLIENT]
Page 18
Deployment Topology 1: minimum config
Presentation for [CLIENT]
Page 19
WWW
DB DB
DB DB
Data Consumer
Date Center 1
Date Center 2
Topology 2: link across DCs (bad)
Presentation for [CLIENT]
Page 20
WWW
DB DB
Data Consumer
GSLB
DB Data
Consumer
VIP
VIP
DB
DB Data
Consumer
DB Data
Consumer
VIP
VIP
GSLB
Topology 3: link to same DC (better)
Presentation for [CLIENT]
Page 21
WWW
DB DB
Data Consumer
GSLB
DB Data
Consumer
VIP
VIP
DB
DB Data
Consumer
DB Data
Consumer
VIP
VIP
Topology 4: use local UNIX socket
Presentation for [CLIENT]
Page 22
WWW
DB DB Data
Consumer
GSLB
DB Data
Consumer
VIP
DB
DB Data
Consumer
DB Data Consumer
VIP
Production Monitoring
Operational Monitoring: logcheck, Scout/NOC alert, etc.
DB monitoring on replication failure, latency, read/write rate, performance metrics.
Presentation for [CLIENT]
Page 23
Metrics Collection
Graphing collected metrics: visualize and collate operational metrics. Help analyzing and fine tuning server performance. Help trace production issues and identify point of failure.
What metrics are important? Host: CPU, MEM, disk I/O, network I/O, # of processes, CPU
swap/paging Server: Throughputs, response time
Comparison: line up charts (throughputs, response time, CPU, disk i/o) in the same time window.
Presentation for [CLIENT]
Page 24
Presentation for [CLIENT]
Page 25
Presentation for [CLIENT]
Page 26
Presentation for [CLIENT]
Page 27
Tuning and Optimizing Queries
Explain: mysql> explain SELECT ... FROM … Watch out for tmp table usage, table scan, etc. SQL_NO_CACHE MySQL Query profiler
mysql> set profiling=1;
Linux OS Cache: leave enough memory on host USE INDEX hint to choose INDEX explicitly
use wisely: most of the time, MySQL chooses the right index for you. But, when table size grows, index cardinality might change.
Presentation for [CLIENT]
Page 28
Important MySQL statistics
SHOW GLOBAL STATUS… Qcache_free_blocks Qcache_free_memory Qcache_hits Qcache_inserts Qcache_lowmem_prunes Qcache_not_cached Qcache_queries_in_cache Select_scan Sort_scan
Presentation for [CLIENT]
Page 29
Important MySQL statistics (cont.)
Table_locks_waited Innodb_row_lock_current_waits Innodb_row_lock_time Innodb_row_lock_time_avg Innodb_row_lock_time_max Innodb_row_lock_waits Select_scan Slave_open_temp_tables
Presentation for [CLIENT]
Page 30
Heuristic Query Optimization Algorithm
Primary for complex cluster queries: find latest N topics and related stories.
Strategy: reduce the number of records database needs to load from disk to perform a query. Pick a default query range. If in-sufficient docs are returned,
expand query range proportionally. If none return => sparse data => drop the range and retry. Save query range for future references.
Result: reduce number of rows needed to process from millions to hundreds => cut query time down from minutes to less than 10 ms.
Presentation for [CLIENT]
Page 31
Presentation for [CLIENT]
Has query range?
Query range look up
Bound query with the range and send it to
DB
Suf@icient results from query engine?
Compute docs to range ratio and save it back
to the look up table for future use.
NumOfTripToDB++
Cluster query
Use default range
Compute docs to range ratio and prorate it to a range that would return
sufficient amount of docs.
Send original query to DB
numOfResults == 0?
NumOfTripToDB >=2?
yes
NumOfTripToDB =0
yes
Return query results to clients.
Query Engine
yes
no
Lessons Learned
Always load test well ahead of launch (2 weeks) to avoid fire drill.
Don’t rely on cache solely. Database needs to be able to serve reasonable amount of queries on its own.
Separate cache from applications to avoid cold start. Keep transaction/query simple and return fast. Avoid table join; limit it to 2 if really needed. Avoid stored procedure: results are not cached; need
DBA when altering implementation.
Presentation for [CLIENT]
Page 33
Lessons Learned (cont.)
Avoid using ‘offset’ in LIMIT clause; use application based pagination instead.
Avoid ‘SQL_CALC_FOUND_ROWS’ in SELECT If possible, exclude text/blob columns from query
results to avoid disk I/O. Store text/blob in separate table to speed up backup,
optimization, and schema change. Separate real time v.s. archive data for better
performance and easier maintenance. Keep table size under control ( < 100 GB) ; optimized
periodically. Presentation for [CLIENT]
Page 34
Lessons Learned (cont.)
Put SQL statement (templates) in resource files so you can tune it without binary change.
Set up replication in dev & qa to catch replication issues earlier Transactional (MySQL 5.0.x) v.s. data/mixed (5.1 or above) Auto-increment + (INSERT.. ON DUPLICATE UPDATE…) Date time column: default to NOW() Oversized data: increase max_allowed_packet Replication lag: transactions that involve index update/
deletion often take longer to complete.
Host and data center redundancy is important – don’t put all eggs in one basket.
Presentation for [CLIENT]
Page 35
RTN 3 Redesign
Free Text Search with SOLR Real time v.s. archive shards. 1 minute latency w/o Ramdisk.
Asset DB partitioned – 5 rows/doc -> 25 rows/doc Avoid (System) Virtual Machine; instead, stack high
end hosts with processes that use different system resources (CPU, MEM, disk space, etc) Better network and system resource utilization – cost effective. Data Locality
More processors (< 12 ) help when under load.
Presentation for [CLIENT]
Page 36
Q & A
Questions or comments?
Presentation for [CLIENT]
Page 37
THANK YOU !!
Presentation for [CLIENT]
Page 38