Drilling into Data with Apache DrillTomer Shiran, Apache Drill Founder and PMC MemberJacques Nadeau, Apache Drill PMC Chair
Tomer Shiran Jacques [email protected] [email protected]
@tshiran @intjesus
Drill founder and PMC MemberMapR VP Product
Drill PMC Chair (VP, Apache Drill)
Apache Drill
• Open source SQL query engine for non-relational datastores– JSON document model– Columnar
• Key advantages:– Query any non-relational datastore– No overhead (creating and maintaining schemas, transforming
data, …)– Treat your data like a table even when it’s not– Keep using the BI tools you love– Scales from one laptop to 1000s of servers– Great performance and scalability
Omni-SQL (“SQL-on-Everything”)
Drill: Omni-SQLWhereas the other engines we're discussing here create a relational database environment on top of Hadoop, Drill instead enables a SQL language interface to data in numerous formats, without requiring a formal schema to be declared. This enables plug-and-play discovery over a huge universe of data without prerequisites and preparation. So while Drill uses SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might be SQL-on-Everything, with very low setup requirements.
Andrew Brust,
“”
Any Non-Relational Datastore
• File systems– Traditional: Local files and NAS– Hadoop: HDFS and MapR-FS– Cloud storage: Amazon S3, Google
Cloud Storage, Azure Blob Storage
• NoSQL databases– MongoDB– HBase– MapR-DB– Hive
• And you can add new datastores
Any Client• Multiple interfaces: ODBC, JDBC, REST, C,
Java• BI tools
– Tableau– Qlik– MicroStrategy– TIBCO Spotfire– Excel
• Command line (Drill shell)• Web and mobile apps
– Many JSON-powered chart libraries (see D3.js)
• SAS, R, …
Drill Integrates With What You Have
Achieving “End-to-End Performance”
Execute fast• Standard SQL• Read data fast• Leverage columnar
encodings and execution
• Execute operations quickly
• Scale out, not up
Iterate fast• Work without prep• Decentralize data
management• In-situ security• Explore + query• Access multiple
sources• Avoid the ETL rinse
cycle
JSON Model, Columnar Speed
JSONBSONMongo
HBaseNoSQL
ParquetAvro
CSVTSV
Schema-lessFixed
schema
Flat
Complex
Name Gender Age
Michael M 6
Jennifer F 3
{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}
RDBMS/SQL-on-Hadoop table
Apache Drill table
Apache Drill Provides the Best of Both Worlds
Acts Like a Database• ANSI SQL: SELECT, FROM, WHERE,
JOIN, HAVING, ORDER BY, WITH, CTAS, ALL, EXISTS, ANY, IN, SOME
• VarChar, Int, BigInt, Decimal, VarBinary, Timestamp, Float, Double, etc.
• Subqueries, scalar subqueries, partition pruning, CTE
• Data warehouse offload• Tableau, ODBC, JDBC• TPC-H & TPC-DS-like workloads• Supports Hive SerDes• Supports Hive UDFs• Supports Hive Metastore
Even When Your Data Doesn’t
• Path based queries and wildcards– select * from /my/logs/– select * from /revenue/*/q2
• Modern data types– Map, Array, Any
• Complex Functions and Relational Operators– FLATTEN, kvgen, convert_from,
convert_to, repeated_count, etc
• JSON Sensor analytics• Complex data analysis• Alternative DSLs
Why? To Support the Changing Data Organization
Data Dev Circa 20001. Developer comes up with
requirements2. DBA defines tables3. DBA defines indices4. DBA defines FK relationships5. Developer stores data6. BI builds reports7. Analyst views reports8. DBA adds materialized
views
Data Today1. Developer builds app, defines
schema, stores data2. Analyst queries data3. Data engineer fixes
performance problems or fills functionality gaps
HOW DOES IT WORK?
Everything Starts With a Drillbit…• High performance query executor• In-memory columnar execution• Directly interacts with data, acquiring
knowledge as it reads• Built to leverage large amounts of
memory• Networked or not• Exposes ODBC, JDBC, REST• Built-in Web UI and CLI• Extensible
Drillbit
Single process (daemon or CLI)
Data Lake, More Like Data Maelstrom
HDFS HDFSmongod mongod
HDFS HDFS
HBase HBaseCassandra Cassandra
HDFS
HDFS
HBaseWindows Desktop
Mac Desktop
HBase & HDFS Cluster
HDFS ClusterMongoDB Cluster
Cassandra Cluster
DesktopClustered Servers
Run Drillbits Wherever; Whatever Your Data
Drillbit
HDFS HDFSmongod mongod
HDFS HDFS
HBase HBase
Drillbit
DrillbitDrillbitDrillbit Drillbit
Cassandra Cassandra
Drillbit Drillbit
HDFS
HDFS
HBase
Drillbit
Drillbit
Windows Desktop
Drillbit
Mac Desktop
Drillbit
Connect to Any Drillbit with ODBC, JDBC, C, Java, REST
1. User connects to Drillbit2. That Drillbit becomes Foreman
– Foreman generates execution plan – Cost-based query optimization &
locality
3. Execution fragments are farmed to other Drillbits
4. Drillbits exchange data as necessary to guarantee relational algebra
5. Results are returned to user through Foreman
Drillbit
User
Drillbit
Drillbit(foreman)
ANALYZING YELP DATA
1. DOWNLOAD AND INSTALL DRILL
Run Drill in Embedded Mode (drill-embedded)
$ tar xf apache-drill-1.0.0.tar.gz
$ cd apache-drill-1.0.0
$ bin/drill-embedded
> SELECT * FROM dfs.root.`/Users/tshiran/yelp/user.json` LIMIT 1;+----------------+----------------------------------+---------------+-------+| yelping_since | votes | review_count | name |+----------------+----------------------------------+---------------+-------+| 2012-02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee |+----------------+----------------------------------+---------------+-------+
• drillbit (Drill daemon) starts automatically in embedded mode• No ZooKeeper in embedded mode• Web UI is available at localhost:8047
Review the Query Profile in the Web UI (localhost:8047)
Run Drill in Distributed Mode$ zkServer start # ZooKeeper maintains the list of drillbits in the cluster
$ bin/drillbit.sh start # conf/drill-override.conf includes cluster name and ZK nodes
$ bin/drill-conf # or bin/drill-localhost to skip ZK lookup
> SELECT stars, count(*) FROM dfs.root.`/Users/tshiran/yelp/review.json` GROUP BY stars ORDER BY stars;+--------+---------+| stars | EXPR$1 |+--------+---------+| 1 | 110772 || 2 | 102737 || 3 | 163761 || 4 | 342143 || 5 | 406045 |+--------+---------+5 rows selected (3.739 seconds)
2. CONFIGURE DATASTORES (STORAGE PLUGINS)
Enable MongoDB Storage Plugin
Define Workspaces in the File Storage Plugin
3. EXPLORE THE DATA
The Data: Files
{ "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"}
The Data: MongoDB Collections$ mongoMongoDB shell version: 2.6.5> show databases;admin (empty)local 0.078GByelp 0.453GB> use yelp> db.users.findOne(){
"_id" : ObjectId("54566cdf3237149de181a92a"),"yelping_since" : "2012-02","votes" : {
"funny" : 1,"useful" : 5,"cool" : 0
},"review_count" : 6,"name" : "Lee","user_id" : "qtrmBGNqCvupHMHL_bKFgQ","friends" : [ ]
}
Are There More 5-Star or 1-Star Reviews?> SELECT stars, count(*) FROM dfs.root.`/Users/tshiran/yelp/review.json` GROUP BY stars ORDER BY stars;+--------+---------+| stars | EXPR$1 |+--------+---------+| 1 | 110772 || 2 | 102737 || 3 | 163761 || 4 | 342143 || 5 | 406045 |+--------+---------+5 rows selected (3.739 seconds)
Using Storage Plugins and Workspaces
> SELECT * FROM dfs.root.`/Users/tshiran/data/yelp/review.json` LIMIT 1;> SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1;> SELECT * FROM mongo.yelp.users LIMIT 1;> USE mongo.yelp;> SELECT * FROM users LIMIT 1;
Storage pluginWorkspace
Path relative to workspace
Storage Plugin Workspace Table
dfs Path Path relative to workspace
mongo Database Collection
hive Database Table
hbase Namespace Table
Most Common User Names (MongoDB)> SELECT name, count(*) AS users FROM mongo.yelp.users GROUP BY name ORDER BY users DESC LIMIT 10;+------------+------------+| name | users |+------------+------------+| David | 2453 || John | 2378 || Michael | 2322 || Chris | 2202 || Mike | 2037 || Jennifer | 1867 || Jessica | 1463 || Jason | 1457 || Michelle | 1439 || Brian | 1436 |+------------+------------+
Cities with the Most Businesses> SELECT state, city, count(*) AS businesses FROM dfs.demo.`/yelp/business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10;+------------+------------+-------------+| state | city | businesses |+------------+------------+-------------+| NV | Las Vegas | 12021 || AZ | Phoenix | 7499 || AZ | Scottsdale | 3605 || EDH | Edinburgh | 2804 || AZ | Mesa | 2041 || AZ | Tempe | 2025 || NV | Henderson | 1914 || AZ | Chandler | 1637 || WI | Madison | 1630 || AZ | Glendale | 1196 |+------------+------------+-------------+
3. EXPLORING COMPLEX DATA
business.json (1){
"business_id": "4bEjOyTaDG24SY5TxsaUNQ","full_address": "3655 Las Vegas Blvd S\nThe Strip\nLas Vegas, NV 89109","hours": {
"Monday": {"close": "23:00", "open": "07:00"},"Tuesday": {"close": "23:00", "open": "07:00"},"Friday": {"close": "00:00", "open": "07:00"},"Wednesday": {"close": "23:00", "open": "07:00"},"Thursday": {"close": "23:00", "open": "07:00"},"Sunday": {"close": "23:00", "open": "07:00"},"Saturday": {"close": "00:00", "open": "07:00"}
},"open": true,"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],"city": "Las Vegas","review_count": 4084,"name": "Mon Ami Gabi","neighborhoods": ["The Strip"],"longitude": -115.172588519464,
business.json (2)"state": "NV","stars": 4.0,
"attributes": {"Alcohol": "full_bar”,
"Noise Level": "average","Has TV": false,"Attire": "casual","Ambience": {
"romantic": true,"intimate": false,"touristy": false,"hipster": false,
"classy": true,"trendy": false,
"casual": false},"Good For": {"dessert": false, "latenight": false, "lunch": false,
"dinner": true, "breakfast": false, "brunch": false},}
}
Which Places Are Open Right Now (22:00)?> SELECT name, b.hours FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Saturday.`open` < '22:00' AND b.hours.Saturday.`close` > '22:00' LIMIT 2;
+------------------------------+------------------------------------------------+| name | hours |+------------------------------+------------------------------------------------+| Chang Jiang Chinese Kitchen | {"Saturday":{"close":"22:30","open":"11:00"}} || Grand China Restaurant | {"Saturday":{"close":"23:00","open":"11:00"}} |+------------------------------+------------------------------------------------+
It’s 10pm in Vegas and I Want Good Hummus!
> SELECT name, b.hours.Friday AS friday, categories FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2;
+--------------------------------+-----------------------------------+--------------------------------------------------------------+| name | friday | categories |+--------------------------------+-----------------------------------+--------------------------------------------------------------+| Olives | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] || Marrakech Moroccan Restaurant | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |+--------------------------------+-----------------------------------+--------------------------------------------------------------+
Flatten Repeated Values> SELECT name, categories FROM dfs.demo.`yelp/business.json` LIMIT 3;+-----------------------------+-------------------------------------------+| name | categories |+-----------------------------+-------------------------------------------+| Eric Goldberg, MD | ["Doctors","Health & Medical"] || Pine Cone Restaurant | ["Restaurants"] || Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |+-----------------------------+-------------------------------------------+
> SELECT name, FLATTEN(categories) AS categories FROM dfs.demo.`yelp/business.json` LIMIT 5;+-----------------------------+-------------------------+| name | categories |+-----------------------------+-------------------------+| Eric Goldberg, MD | Doctors || Eric Goldberg, MD | Health & Medical || Pine Cone Restaurant | Restaurants || Deforest Family Restaurant | American (Traditional) || Deforest Family Restaurant | Restaurants |+-----------------------------+-------------------------+
Most and Least Common Business Categories
> SELECT category, count(*) AS businesses FROM (SELECT name, FLATTEN(categories) AS category FROM dfs.demo.`yelp/business.json`) c GROUP BY category ORDER BY businesses DESC;+-----------------------------------+-------------+| category | businesses |+-----------------------------------+-------------+| Restaurants | 14303 || Shopping | 6428 |…| Australian | 1 || Boat Dealers | 1 || Firewood | 1 |+-----------------------------------+-------------+715 rows selected (3.439 seconds)
> SELECT name, categories FROM dfs.demo.`yelp/business.json` WHERE true and REPEATED_CONTAINS(categories, 'Australian');+------+------------+| name | categories |+------+------------+| The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] |+------+------------+
4. LEVERAGING VIEWS
Create a View for Name-Gender Mapping
> CREATE VIEW dfs.tmp.`names` AS SELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`;> USE dfs.tmp;> CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`;> SELECT * FROM dfs.tmp.names WHERE name = 'John';+------------+------------+| name | gender |+------------+------------+| John | Male |+------------+------------+
columns[0]
columns[4]
names.csv:
Most Common Names (and their Genders) on Yelp
> SELECT u.name, n.gender, count(*) AS number FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY u.name, n.gender ORDER BY number DESC LIMIT 10;+------------+------------+------------+| name | gender | number |+------------+------------+------------+| David | Male | 2453 || John | Male | 2378 || Michael | Male | 2322 || Chris | Unknown | 2202 || Mike | Male | 2037 || Jennifer | Female | 1867 || Jessica | Female | 1463 || Jason | Male | 1457 || Michelle | Female | 1439 || Brian | Male | 1436 |+------------+------------+------------+
Who Rates Higher – Men or Women?> SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY n.gender;+------------+------------+------------+| gender | users | stars |+------------+------------+------------+| Female | 103684 | 3.77 || Male | 97430 | 3.696 || Unknown | 18409 | 3.727 |+------------+------------+------------+
Who Writes Longer Reviews – Men or Women?
> SELECT n.gender, round(avg(length(r.text))) AS review_length FROM dfs.demo.`yelp/review.json` r, mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name AND r.user_id = u.user_id GROUP BY n.gender;+------------+---------------+| gender | review_length |+------------+---------------+| Male | 665 || Female | 730 || Unknown | 711 |+------------+---------------+
It takes a 3-way join to find out…
Thank You!
• Download at drill.apache.org
• Get in touch:• [email protected]• [email protected]
• Ask questions:• [email protected]
• Tweet: @ApacheDrill