24
Presto SQL Engine for Big Data Dain Sundstrom @daindumb Saturday, February 22, 14

Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Embed Size (px)

Citation preview

Page 1: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

PrestoSQL Engine for Big Data

Dain Sundstrom@daindumb

Saturday, February 22, 14

Page 2: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Dain Sundstrom▪Founder of Presto▪Engineer at Facebook▪Background in Distribute Computing▪Founder Apache Geronimo▪Original JBoss Group member▪JVM internals / performance

Saturday, February 22, 14

Page 3: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Why build Presto?

Saturday, February 22, 14

Page 4: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

“A good day is when I can run ! Hive queries”

— a Facebook data scientist

Saturday, February 22, 14

Page 5: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

“How do I join live data with Hive data?”

— every Facebook engineer

Saturday, February 22, 14

Page 6: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

“How do use Tableau with our data?”

— a Facebook analyst

Saturday, February 22, 14

Page 7: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

What is Presto?

Saturday, February 22, 14

Page 8: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

What is Presto?▪Distributed SQL Analytics Engine▪Optimized for low-latency, interactive analysis▪Extensible connector system▪ANSI SQL

Saturday, February 22, 14

Page 9: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Facebook Scale▪We have lots of:▪Users (with varying skill)▪Data (hot, cold, live)▪Machines (cpu heavy, disk heavy)▪Clusters (multiple regions)▪Custom stores (with custom query lang.)

Saturday, February 22, 14

Page 10: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Architecture

Saturday, February 22, 14

Page 11: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Architecture

Scheduler

Data Location API

Parser/ Analyzer Planner

Metadata API

Coordinator

Client

Worker

Worker

Worker

Data Stream API

Data Stream API

Saturday, February 22, 14

Page 12: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Main Design Points▪Specialized SQL operators▪Keep data in memory during execution▪Adaptive scheduling and execution▪Bytecode generation▪Efficient flat-memory data structures

Saturday, February 22, 14

Page 13: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Pluggable StoresCoordinator Worker

Parser/ Analyze Planner Scheduler

Inte

rnal

HB

ase

Cas

san

dra

JMX

Hiv

e

Metadata API

Inte

rnal

HB

ase

Cas

san

dra

JMX

Hiv

e

Data Location API

Inte

rnal

Hb

ase

Cas

san

dra

JMX

Hiv

e

Data Stream API

Saturday, February 22, 14

Page 14: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Scaling Issues

Saturday, February 22, 14

Page 15: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Zero Sum▪Where should we run Presto?▪Add machines in neighbor cluster?▪Add new rack to Hive cluster?▪Run Presto next to Hive processes?

Saturday, February 22, 14

Page 16: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Large Tables▪We have tables with:▪Millions of partitions▪Single partition with PB of data▪!"k columns▪#MB+ column values

Saturday, February 22, 14

Page 17: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Batch to Interactive▪Batch systems hide bugs with retry▪Batch systems are managed differently▪Monitoring and alerting too slow ▪Restart whole machines not processes▪Batch systems run near capacity

Saturday, February 22, 14

Page 18: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Cross Region Joins?▪Data silos tend to be in different clusters (rooms) and regions▪Currently users manually copy data

▪Unsolved issue

Saturday, February 22, 14

Page 19: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

What’s Next

Saturday, February 22, 14

Page 20: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Approximate Queries

!"-!""x faster

Saturday, February 22, 14

Page 21: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Index Joins▪HBase and Cassandra are really indexes▪For HBase:▪Allow for multiple indexes per table▪API extensions to make this fast▪How do you decide to use an index?▪Most stores have no statistics

Saturday, February 22, 14

Page 22: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Type System▪Plugins can add scalar types▪Add functions for types▪Operator overloading▪Integrate with ordering and grouping▪Custom memory encoding

Saturday, February 22, 14

Page 23: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Upcoming ▪Large joins and aggregations▪Structural types▪HBase connector▪IPython▪ODBC

Saturday, February 22, 14

Page 24: Presto - SCALE · Founder of Presto ... JVM internals / performance Saturday, February 22, 14. Why build Presto? Saturday, February 22, 14 “A good day is when I can run !

Presto prestodb.io github.com/facebook/presto

Dain Sundstrom @daindumb github.com/dain

Saturday, February 22, 14