Building and deploying large scale real time news system with my sql and distributed cache mysql_conf

Apr. 13, 2011 Presented to MySQL Conference

Building and Deploying Large Scale Real Time News System with

MySQL and Distributed Cache

Who am I?

  Tao Cheng <[email protected]>, AOL Real Time News (RTN).

  Worked on Mail and Browser clients in the ‘90 and then moved to web backend servers since.

  Not an expert but am happy to share my experience and brainstorm solutions.

Presentation for [CLIENT]

Page 2

Agenda

  AOL Real Time News (RTN): what it is?   Requirements   Technical solutions with focus on MySQL   Deployment Topology   Operational Monitoring   Metrics Collection

Agenda

  Tips for query tuning and optimization   Heuristic Query Optimization Algorithm   Lessons learned   Q & A

Real Time News : background

Presentation for AOLU Un-University

Page 5

AOL deployed its large scale Real Time News (RTN) system in 2007. This system ingests and processes news from 30,000 sources on every second around the clock. Today, its data store, MySQL, has accumulated over several billions of rows and terabytes of data. However, news are delivered to end users in close to real time fashion. This presentation shares how it is done and the lessons learned.

Brief Intro: sample features

  Data presentation: return most recent news in   flat view – most recent news about an entity. An entity could

be a person, a company, a sports team, etc.   topic clusters – most recent news grouped by topics. A topic is

a group of news about an event, headline news, etc.

  News filtering by   source types such as news, blogs, press releases, regional, etc.   relevancy level (high, medium, low, etc) to the entities .

  Data Delivery: push (to subscribers) and pull   Search by entities, categories (National, Sports,

Finance, etc), topics, document ID, etc. Presentation for [CLIENT]

Page 6

Requirements for Phase I (2006)

  Commodity hardware: 4 CPU, 16 GB MEM, 600 GB disk space.

  Data ingestion rate = 250K docs/day; average document size = 5 KB.

  Data retention period: 7 days to forever   Est. data set size: (1.25 GB/day or 456 GB/year) +

space for indexes, schema change, and optimization.   Response time: < 30 milli-second/query   Throughputs: > 400 queries/sec/server   Up time: 99.999% Presentation for [CLIENT]

Page 7

Solutions: MySQL + Bucky

  MySQL   Serve raw/distinct queries   Back fill

  Bucky Technology (AOL’s distributed cache & computing framework)   Write ahead cache: pre-compute query results and push them

into cache.   Messaging (optional): push data directly to subscribers

 Updates are pushed to data consumers or browsers via AIM Complex.

  Updates go to both database and cache.


Page 8

Architecture Diagram (over-simplified)


Page 9

Asset DB

AIM

Distributed Cache

Distributed Cache Gateway

Gateway

Ingestor pull

push

WWW

WWW

Relegence

Data Model: SOR v.s. Query DB

  Separate query from storage to keep tables small and query fast.

  System of Record (SOR): has all raw data   The authoritative data store; designed for data storage   Normalized schema: for simple key look-up; no table join.

  Query DB – de-normalized for query speed   avoid JOIN, reduce # of trips to DB, increase throughputs.

  Read/write small chunk of data at a time so database can get requests out quickly and process more.

  Use replication to achieve linear scalability for read.


Page 10

Design Strategies: partitioning (Why)

  Dataset too big to fit on one host   Performance consideration: divide and conquer

  Write: more masters (Nx) to take writes   Read: smaller tables + more (NxM) slaves to handle read.

  Fault tolerance – distribute the risk and reduce the impact of system failure

  Easier Maintenance – size does matter   Faster nightly backup, disaster recovery, schema change, etc.   Faster optimization –need optimization to reclaim disk space

after deletion, rebuild indexes to improve query speed.


Page 11

Design Strategies: partitioning (How)

  Partition on most used keys (look at query patterns)   Document table – on document ID   Entity table – on entity ID

  Simple hash on IDs – no partition map; thus no competition of read/write locks on yet another table

  Managing growth: add another partition set   New documents are written into both old and new partition

sets for a few weeks. Then, stop writing into the old partitions.   Queries go to the new partitions first and then the old ones if

in-sufficient results found.

  Works great in our case but might not for everyone. Presentation for [CLIENT]

Page 12

Schema design: De-normalization

  Make query tables small:   put only essential attributes in the de-normalized tables   store long text attributes in separate tables.

  De-normalization: how to store and match attributes   Single value attributes (1:1) : document ID, short string, date

time, etc. – one column, one row.   Multi-value attributes (1:many): tricky but feasible

 Use multiple rows with composite index/key: (c1, c2, etc.)  One row one column: CSV string, e.g., “id1, id2, id3” – SQL: “val

like ‘%id2%’”  One row but multiple columns, e.g., group1, group2, etc. – SQL:

group1=val1 OR group2=val2 ...


Page 13

Tips for indexing

  Simple key – for metadata retrieval   Composite key – find matching documents

  Start with low cardinality and most used columns   Order matter: (c1, c2, c3) != (c2, c3, c1)

  InnoDB – all secondary indexes contain primary key   Make primary key short to keep index size small   Queries using secondary index references primary key too.

  Integer v.s. String – comparison of numeric values is faster => index hash values of long string instead.

  Index length – title:varchar(255) => idx_title(32)   Enforce referential integrity on application side. Presentation for [CLIENT]

Page 14

MySQL configuration

  Storage engine: InnoDB – row level locking   Table space – one file per table

  Easier to maintain (schema change, optimization, etc.)

  Character set: ‘UTF-8’   Disable persistent connection (5.0.x)   skip-character-set-client-handshake

  Enable slow query log to identify bad queries.   System variables for memory buffer size

  innodb_buffer_pool_size: data and indexes   Sort_buffer_size, max_heap_table_size, tmp_table_size   Query cache size=0; tables are updated constantly


Page 15

Runtime statistics (per server)

  Average write rate:   daily: < 40 tps   max at 400 tps during recovery   Perform best when write rate < 100 tps

  Query rate: 20~80 qps   Query response time – shorter when indexes and

data are in memory   75%: ~3 ms when qps < 15; ~2 ms when qps ~= 60   95%: 6~8 ms when qps < 15; 3~4 ms when qps ~= 60   CPU Idle %: > 99%.


Page 16


Page 17

Deployment Topology Consideration

•  Minimum configuration: host/DC redundency •  DC1: host 1 (master), host 3 (slave) •  DC2: host 2 (failover master), host 4 (slave)

•  Data locality: significant when network latency is a concern (100 Mbps) •  3,000 qps when DB is on remote host. •  15,000 qps when DB is on local host.

•  Linking dependent servers across data centers •  Push cross link up as far as possible (Topology 3): link to

dependent servers in the same data center.


Page 18

Deployment Topology 1: minimum config


Page 19

WWW

DB DB

DB DB

Data Consumer

Date Center 1

Date Center 2

Topology 2: link across DCs (bad)


Page 20

WWW

DB DB

Data Consumer

GSLB

DB Data

Consumer

VIP

VIP

DB

DB Data

Consumer

DB Data

Consumer

VIP

VIP

GSLB

Topology 3: link to same DC (better)


Page 21

WWW

DB DB

Data Consumer

GSLB

DB Data

Consumer

VIP

VIP

DB

DB Data

Consumer

DB Data

Consumer

VIP

VIP

Topology 4: use local UNIX socket


Page 22

WWW

DB DB Data

Consumer

GSLB

DB Data

Consumer

VIP

DB

DB Data

Consumer

DB Data Consumer

VIP

Production Monitoring

  Operational Monitoring: logcheck, Scout/NOC alert, etc.

  DB monitoring on replication failure, latency, read/write rate, performance metrics.


Page 23

Metrics Collection

  Graphing collected metrics: visualize and collate operational metrics.   Help analyzing and fine tuning server performance.   Help trace production issues and identify point of failure.

  What metrics are important?   Host: CPU, MEM, disk I/O, network I/O, # of processes, CPU

swap/paging   Server: Throughputs, response time

  Comparison: line up charts (throughputs, response time, CPU, disk i/o) in the same time window.


Page 24


Page 25


Page 26


Page 27

Tuning and Optimizing Queries

  Explain: mysql> explain SELECT ... FROM …   Watch out for tmp table usage, table scan, etc.   SQL_NO_CACHE   MySQL Query profiler

  mysql> set profiling=1;

  Linux OS Cache: leave enough memory on host   USE INDEX hint to choose INDEX explicitly

  use wisely: most of the time, MySQL chooses the right index for you. But, when table size grows, index cardinality might change.


Page 28

Important MySQL statistics

  SHOW GLOBAL STATUS…   Qcache_free_blocks   Qcache_free_memory   Qcache_hits   Qcache_inserts   Qcache_lowmem_prunes   Qcache_not_cached   Qcache_queries_in_cache   Select_scan   Sort_scan


Page 29

Important MySQL statistics (cont.)

  Table_locks_waited   Innodb_row_lock_current_waits   Innodb_row_lock_time   Innodb_row_lock_time_avg   Innodb_row_lock_time_max   Innodb_row_lock_waits   Select_scan   Slave_open_temp_tables


Page 30

Heuristic Query Optimization Algorithm

  Primary for complex cluster queries: find latest N topics and related stories.

  Strategy: reduce the number of records database needs to load from disk to perform a query.   Pick a default query range. If in-sufficient docs are returned,

expand query range proportionally.   If none return => sparse data => drop the range and retry.   Save query range for future references.

  Result: reduce number of rows needed to process from millions to hundreds => cut query time down from minutes to less than 10 ms.


Page 31


Has query range?

Query range look up

Bound query with the range and send it to

DB

Suf@icient results from query engine?

Compute docs to range ratio and save it back

to the look up table for future use.

NumOfTripToDB++

Cluster query

Use default range

Compute docs to range ratio and prorate it to a range that would return

sufficient amount of docs.

Send original query to DB

numOfResults == 0?

NumOfTripToDB >=2?

yes

NumOfTripToDB =0

yes

Return query results to clients.

Query Engine

yes

no

Lessons Learned

  Always load test well ahead of launch (2 weeks) to avoid fire drill.

  Don’t rely on cache solely. Database needs to be able to serve reasonable amount of queries on its own.

  Separate cache from applications to avoid cold start.   Keep transaction/query simple and return fast.   Avoid table join; limit it to 2 if really needed.   Avoid stored procedure: results are not cached; need

DBA when altering implementation.


Page 33

Lessons Learned (cont.)

  Avoid using ‘offset’ in LIMIT clause; use application based pagination instead.

  Avoid ‘SQL_CALC_FOUND_ROWS’ in SELECT   If possible, exclude text/blob columns from query

results to avoid disk I/O.   Store text/blob in separate table to speed up backup,

optimization, and schema change.   Separate real time v.s. archive data for better

performance and easier maintenance.   Keep table size under control ( < 100 GB) ; optimized

periodically. Presentation for [CLIENT]

Page 34

Lessons Learned (cont.)

  Put SQL statement (templates) in resource files so you can tune it without binary change.

  Set up replication in dev & qa to catch replication issues earlier   Transactional (MySQL 5.0.x) v.s. data/mixed (5.1 or above)   Auto-increment + (INSERT.. ON DUPLICATE UPDATE…)   Date time column: default to NOW()   Oversized data: increase max_allowed_packet   Replication lag: transactions that involve index update/

deletion often take longer to complete.

  Host and data center redundancy is important – don’t put all eggs in one basket.


Page 35

RTN 3 Redesign

  Free Text Search with SOLR   Real time v.s. archive shards.   1 minute latency w/o Ramdisk.

  Asset DB partitioned – 5 rows/doc -> 25 rows/doc   Avoid (System) Virtual Machine; instead, stack high

end hosts with processes that use different system resources (CPU, MEM, disk space, etc)   Better network and system resource utilization – cost effective.   Data Locality

  More processors (< 12 ) help when under load.


Page 36

Q & A

  Questions or comments?


Page 37

  THANK YOU !!


Page 38