Upload
kbajda
View
925
Download
0
Embed Size (px)
Citation preview
Presto SQL Engine: what’s new? Strata Hadoop 2016 San Jose, CA
2
What is Presto?
100% open source distributed SQL query engine
Originally developed by Facebook
Key Differentiators:
Performance & Scale
Cross platform query capability, not only SQL on Hadoop
Apache licensed, hosted on GitHub
Certified distro & support from Teradata
3
Brief history of Presto
FALL 2012 6 developers start Presto
development
FALL 2014 88 Releases
41 Contributors 3943 Commits
SPRING 2016 141 Releases
116 Contributors 6879 Commits
SPRING 2013 Presto rolled out within Facebook
FALL 2013 Facebook open sources Presto
FALL 2008 Facebook open
sources Hive
4
• Facebook – Multiple production clusters (100s of nodes total)
- Massive 300PB Hadoop data warehouse
- Very large sharded MySQL installation
- Growing usage of Raptor SSD-based storage
– 1000s of internal daily active users
– 10-100s of concurrent queries
• Netflix – Over 200-node production cluster on EC2
– Over 25 PB in S3 (Parquet format)
– Over 350 active users and 3K queries daily
Presto in Production
5
Presto Architecture
Data stream API
Worker
Data stream API
Worker
Coordinator
Metadata
API
Parser/
analyzer Planner Scheduler
Worker
Client
Data location
API
Pluggable
6
Presto Extensibility – connectors
Parser/
analyzer Planner
Worker
Data location API
HD
FS /
S3
No
SQ
L
DB
MS
Cu
sto
m
…
Metadata API
HD
FS /
S3
No
SQ
L
DB
MS
Cu
sto
m
…
Data stream API
HD
FS /
S3
No
SQ
L
DB
MS
Cu
sto
m
…
Scheduler
Coordinator
7
• Hadoop/Hive connector & file formats: – HDFS & S3 + HCatalog
– ORC, RCFile, Parquet, SequenceFile, Text
• Open source data stores: – MySQL & PostgreSQL (non-parallel)
– Cassandra
– Kafka
– Redis
• In development by community: – MongoDB
– ElasticSearch
– HBase
Supported data sources & file formats
8
• In-memory processing
• Pipelined execution across nodes MPP-style
• Vectorized columnar processing
• Multithreaded execution keeps all CPU cores busy
• Presto is written in highly tuned Java
– Efficient flat-memory data structures (minimizes GC)
– Very careful coding of inner loops
– Runtime bytecode generation
• Optimized ORC & Parquet readers
• Excellent performance with interactive SQL analytics
Presto – Query Execution Performance
9
[ WITH with_query [, ...] ] SELECT [ ALL | DISTINCT ] select_expr [, ...] [ FROM table1 [[ INNER | OUTER ] JOIN table2 ON (…)] [ WHERE condition ] [ GROUP BY expression [, ...] ] [ HAVING condition] [ UNION [ ALL | DISTINCT ] select ] [ ORDER BY expression [ ASC | DESC ] [, ...] ] [ LIMIT [ count | ALL ] ]
In addition: • Windowing functions
• Statistical and approximate aggregate functions
• UNNEST, TABLESAMPLE
In development: • Complex subqueries
• EXISTS, INTERSECT, EXCEPT
• ROLLUP, CUBE
ANSI SQL Support
10
• Cluster deployment models for Presto: – on premise (appliance or commodity clusters)
– VM (OpenStack, etc.)
– cloud (Amazon, etc)
• Types of Hadoop deployments: – on Hadoop/YARN cluster (all or subset of nodes)
– on a dedicated cluster
– mixed
Deployment models
11
Open source initiative
• Announced in June 2015 at Hadoop Summit – Growing interest and adoption
• Collaboration with Facebook and Presto community – Joint development, conference talks, meetups and webinars
• Major commitment from Teradata Labs: – 20 full-time engineers
– Free and open source contributions
– Enterprise-ready distribution
"A special shout out goes to Teradata — which joined the Presto community this year with a focus on enhancing enterprise features and providing support — for having seven of our top 10 external contributors." — Facebook
12
Implement Integrate Proliferate
• Installer • Documentation • Monitoring & Support
Tools
• Management Tool Integration
• YARN Integration • ODBC Driver
• JDBC Driver • BI Certification • Security • Cloud features
Commercial Support
Phase 1 Phase 2 Phase 3 June 8, 2015 Q4 2015 2016
Expanding ANSI SQL Coverage
Teradata Contributions to Presto
13
Recent developments and roadmap
• Q1 release: – Fully-featured ODBC & JDBC drivers
– Kerberos support
– DECIMAL support
• Later 2016: – BI tools certification
– TPC-H and TPC-DS unmodified
– Spill to disk
14
BI Tools certifications
15
Presto Connectors
Teradata Certified
Community Supported
Teradata QueryGrid™ - Multi-System Analytics
Targets
Entry Points
TERADATA DATABASE
ASTER ANALYTICS
PRESTO HADOOP
HIVE / HDFS
HADOOP
OTHER DATABASE
S
NOSQL DATABASE
S
TERADATA DATABASE
ASTER ANALYTICS
PRESTO HADOOP
Non-Relational DBs Multi-Genre Advanced Analytics™
Integrated Data Warehouses
3rd Party Relational DBs
Multiple Hadoop SQL Query Engines and Distributions
APACHE KAFKA
APACHE CASSANDRA
MYSQL POSTGRESQL PRESTO API REDIS
16
Certified Distro: www.teradata.com/presto
Website: www.prestodb.io
Presto Users Group: www.groups.google.com/group/presto-users
GitHub:
www.github.com/prestodb/presto
www.github.com/Teradata/presto
www.github.com/prestodb
More information
17
www.teradata.com/presto