View
822
Download
3
Embed Size (px)
Designing An Evolving Database Service with Presto
Taro L. Saito leo@tresaure-data.com
Oct 6th, 2015. Presto Meetup @ Boston
mailto:leo@tresaure-data.com
Presto Usage at Treasure Data
2
100~ customers are actively using Presto 30,000~ Presto queries every day Importing 1,000,000~ records / sec.
Import Export
Store Analyze with Presto/Hive
Mobile and Web Sources
Mobile SDKs
JavaScript SDK (web access logs)
3
Stream Sources
Streaming
Apache Logs nginx logs
syslogJSON logs
4
JSON
Existing Data Sources
Bulk Import
Data files (CSV, TSV, etc.) MySQL
PostgreSQLOracle
5
Embedded Devices
Collect data from Embedded linux, serial devices, MQTT, XBee Radio, etc.
6
Import data, now.
7
Treasure Data Architecture
8
LogLogLogLogLogLog
1-hourpartition1-hour
partition1-hourpartition
Hadoop MapReduce
2015-09-29 01:00:00
2015-09-29 02:00:00
2015-09-29 03:00:00
Real-Time Storage
ArchiveStorage
time column-based partitioning
Hive Presto
Log
many small log files log merge job
LogLogLogLogLog
Distributed SQL Query Engine
S3 (AWS) Rick CS (IDCF)
Columnar Format
JSON data {time: 1412380700, user:1}
Additional Column {time: 1412381000, user:2, status:200}
Type Escalation (int -> string) {time: 1412390000, user:U01, status:200}
MessagePack A fast and compact JSON-like format
Auto type conversion Table schema MessagePack types
Extensible Columnar Store
9
Use Cases
E-COMMERCE
BEFORE
AFTER
Biggest Mobile Shopping
WISH.COM
Reduced costs
Scalability
Single data warehouse11
http://WISH.COM
GAMING
BEFORE
AFTER
Daily Upload Delay of 1-2 days
2500+ servers
Real-timeReal-time
2500+ servers
1 Billion records/day
Reduced TCO
Real-time collection
Real-time access to KPIs
Top 10 globally; 40M+ users
x 20
12
AD TECH
Publishers Dashboard Advertisers Dashboard
800 B/month
Live in 2 weeks with 1 engineer!
300% growth
Europes largest mobile ad-exchange
More than 50 billion impressions/month
13
LOYALTY
Aggregation
E-CommerceMarketing Campaigns;
Promotions
Customer Segmentation
A/B Testing
14
Challenges Handle Huge Query Result Output
SELECT */ CREATE TABLE AS /INSERT INTO Parallel Result Upload to S3
Bypass JSON result generation at the coordinator
td-presto connector Accesses MessagePack based columnar store Handle S3 access retry / pipelining
Future: Better query plan visualization
Quickly find the performance bottleneck and memory consuming tasks Storing intermediate query results to disks
Process large joins, query resource limitation
15
Extensible Schema SQL via Hive, Presto
Unlimited Users, Queries
Enterprise Apps
Enterprise Apps Data Science Tools
REST API
Ingestion: Streaming, Bulk
BI Tools
treasuredata.com/request_demo
http://treasuredata.com/request_demo