Upload
mapr-technologies-japan
View
1.774
Download
0
Embed Size (px)
Citation preview
2015 MapR Technologies 1
2015 MapR Technologies
Apache Drill Overview
M.C. Srivas CTO and Co-Founder, MapR Technologies Data Engineer, MapR Technologies 2015 9 15
2015 MapR Technologies 2
(@nagix) MapR Technologies
NS-SHAFT
!
2015 MapR Technologies 3
2015 MapR Technologies 4
Apache Drill 1.0 (5/19) http://drill.apache.org
2015 MapR Technologies 5
Apache Drill
2015 MapR Technologies 6 2015 MapR Technologies
Apache Drill
2015 MapR Technologies 7
1980 2000 2010 1990 2020
80%
: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
2015 MapR Technologies 8
1980 2000 2010 1990 2020
DB
GBTB TBPB
2015 MapR Technologies 9
SQL
SQL NoSQL
SQL
BI (TableauMicroStrategy )
HDFS (ParquetJSON ) HBase
2015 MapR Technologies 10
Industry's First Schema-free SQL engine
for Big Data
2015 MapR Technologies 11
&
BI
ITBI
BI
ITBI ITBI
BI
IT
IT ETL
IT
1980 -1990 2000
2015 MapR Technologies 12
Hadoop
Hadoop
:
:
2015 MapR Technologies 13
Drill
(Hive )
2
SCHEMA ON WRITE
SCHEMA BEFORE READ
SCHEMA ON THE FLY
2015 MapR Technologies 14
Drill
JSON BSON
HBase
Parquet Avro
CSV TSV
Name ! Gender ! Age !Michael ! M ! 6 !Jennifer ! F ! 3 !
{ ! name: { ! first: Michael, ! last: Smith ! }, ! hobbies: [ski, soccer], ! district: Los Altos !} !{ ! name: { ! first: Jennifer, ! last: Gates ! }, ! hobbies: [sing], ! preschool: CCLC !} !
RDBMS/SQL-on-Hadoop
Apache Drill
2015 MapR Technologies 15
- - HBase - Hive
Drill SQL on Everything
SELECT * FROM dfs.yelp.`business.json` !
- - Hive - HBase
- DFS (Text, Parquet, JSON) - HBase/MapR-DB - Hive /HCatalog - Hadoop API
2015 MapR Technologies 16
(drillbit)
(MapReduce, Spark, Tez)
ZooKeeper drillbit ZooKeeper drillbit ZooKeeper drillbit
2015 MapR Technologies 17
Drill
HDFS MapR-FS DataNode drillbit HBase MapR-DB RegionServer drillbit MongoDB mongod drillbit ()
drillbit
DataNode/RegionServer/
mongod
drillbit
DataNode/RegionServer/
mongod
drillbit
DataNode/RegionServer/
mongod
ZooKeeper ZooKeeper
ZooKeeper
2015 MapR Technologies 18
SELECT*
drillbit ZooKeeper
(JDBC, ODBC,
REST)
1. drillbit
3. 4.
ZooKeeper ZooKeeper
drillbit drillbit
2. drillbit
5.
* CTAS (CREATE TABLE AS SELECT) 14
2015 MapR Technologies 19
drillbit
SQL Hive
HBase
MongoDB
DFS
RPC
2015 MapR Technologies 20 2015 MapR Technologies
2015 MapR Technologies 21
M.C. Srivas MapR Technologies CTO
MapReduce, Bigtable
Netapp
AFS AFS
2015 MapR Technologies 22
Drill
Raw Data Exploration JSON Analytics Data Hub Analytics
Hive HBase
{JSON}, Parquet Text
2015 MapR Technologies 23
IOT
SaaS Apache Drill JSON BI ODBC
ETL
2015 MapR Technologies 24
SQL Hadoop
MapR Drill PigHiveQLSQL
Drill Tableau Squirrel
MapR 1/100 $1,000 / TB MapR Drill BI SQL Hadoop
SQL $100,000 / TB
ETL SQL
2015 MapR Technologies 25
Customer-facing Analytics as a Service Drill
MapR Drill Drill
Hadoop SQL
Drill
JSON Parquet 10GB4TB 160
SLA
2015 MapR Technologies 26
MapR Optimized Data Architecture
, SaaS,
, E
, ,
, ,
Data Movement
Data Access
BI,
,
Optimized Data Architecture
MAPR DISTRIBUTION FOR HADOOP
(Spark Streaming,
Storm)
MapR Data Platform MapR-DB
MAPR DISTRIBUTION FOR HADOOP
(MapReduce,
Spark, Hive, Pig)
MapR-FS
(Drill,
Impala)
2015 MapR Technologies 27 2015 MapR Technologies
2015 MapR Technologies 28
Apache Drill
Drill Beta(20149 - 20154)
Drill 1.0(20155)
Drill 1.1(20157)
Drill 1.2(20159)
Drill 1.3()
2015 MapR Technologies 29
Apache Drill (2015)
ANSI SQL o (Rank, Row_number,
OVER, PARTITION BY) o CTAS
o Hive &
o Hive UDF o Hive Impersonation o AVRO
(Beta) JDBC
Drill 1.1
ANSI SQL o (Lead, Lag,
First_Value, Last_value, NTile) o Drop Table
o Hive
o Hive
o MapR-DB
o
Drill Web UI
Drill 1.2 ANSI SQL o Insert/Append
o
o o Drill on MapR-DB JSON
o MapR-DB
o Parquet
Drill 1.3
2015 MapR Technologies 30
Hive BI Hive Hive
Hive Hive Drill Hive UDF Hive Drill Impersonation
Hive
Parquet & Text
Hive
Drill
Drill ODBC
Drill JDBC
1.1
1.2
2015 MapR Technologies 31
MapR-DB BI (Tableau,
MicroStrategy, Qlikview, ) MapR-DB KV MapR-DB JSON
MapR-DB SQL ES
MapR-DB
MapR-DB
Drill
Drill ODBC
Drill JDBC
1.2 1.3
1.3
2015 MapR Technologies 32
ANSI SQL
Count/Avg/Min/Max/Sum Over/Partition By Rank, Dense_Rank, Percent_Rank, Row_Number, Cume_Dist Lead, Lag, First_Value, Last_Value, Ntile
SQL DDL Parquet Drop table Insert/Append
1.1
1.2
1.1 1.2 1.3
1.1
2015 MapR Technologies 33
PAM +
Impersonation
Drill View
JDBC/ODBC
Web UI Files HBase Hive
Drill View 1
Drill View 2
U U U
User
1.2
2015 MapR Technologies 34
&
BI
2015 MapR Technologies 35 2015 MapR Technologies
2015 MapR Technologies 36
Drill (e-Stat)
2015 MapR Technologies 37
Drill (e-Stat)
e-Stat Apache Drill http://nagix.hatenablog.com/entry/2015/05/21/232526
2015 MapR Technologies 38
2015 MapR Technologies 39
2015 MapR Technologies 40
Drill JDK 7 $ wget http://getdrill.org/drill/download/apache-drill-1.1.0.tar.gz$ tar -xvzf apache-drill-1.1.0.tar.gz$ apache-drill-1.1.0/bin/drill-embedded0: jdbc:drill:zk=local>
2015 MapR Technologies 41
$ ls -l
2015 MapR Technologies 42
README$ cat README
2015 MapR Technologies 43
2015 MapR Technologies 44
MySQL DROP TABLE IF EXISTS ``;CREATE TABLE `` ( `id` int(11) NOT NULL AUTO_INCREMENT, `createdon` timestamp NULL DEFAULT NULL, `createdby` int(11) DEFAULT NULL, ...) ENGINE=InnoDB AUTO_INCREMENT=36993336 DEFAULT CHARSET=utf8;
LOCK TABLES `` WRITE;INSERT INTO `` VALUES (9,'2002-01-17 02:15:08',0,'2011-10-14 13:47:31',20,2,2,1,1,0,19630, ... ),( ... ), ... ,( ... );INSERT INTO `` VALUES (2297,'2002-03-19 22:13:14',0,'2011-10-14 15:47:29',11,3,2,1,2,0,21891, ... ),( ... ), ... ,( ... );...
2015 MapR Technologies 45
MySQL DROP TABLE IF EXISTS ``;CREATE TABLE `` ( `id` int(11) NOT NULL AUTO_INCREMENT, `createdon` timestamp NULL DEFAULT NULL, `createdby` int(11) DEFAULT NULL, ...) ENGINE=InnoDB AUTO_INCREMENT=36993336 DEFAULT CHARSET=utf8;
LOCK TABLES `` WRITE;INSERT INTO `` VALUES (9,'2002-01-17 02:15:08',0,'2011-10-14 13:47:31',20,2,2,1,1,0,19630, ... ),( ... ), ... ,( ... );INSERT INTO `` VALUES (2297,'2002-03-19 22:13:14',0,'2011-10-14 15:47:29',11,3,2,1,2,0,21891, ... ),( ... ), ... ,( ... );...
CSV
2015 MapR Technologies 46
MySQL CSV #!/usr/bin/perl
while () { s/^(--|\/\*| |\)|DROP|CREATE|LOCK).*//g; # s/^INSERT INTO .+ VALUES \(//g; # INSERT s/(?
2015 MapR Technologies 47
CSV SELECT
3197
0: jdbc:drill:zk=local> SELECT count(*) FROM dfs.`/tmp/.csv`;.csv`;+-----------+| EXPR$0 |+-----------+| 31971575 |+-----------+1 row selected (32.733 seconds)
2015 MapR Technologies 48
CSV SELECT
CSV columns [a,b,...]
0: jdbc:drill:zk=local> !set maxwidth 1600: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/.csv` LIMIT 3;+---------+| columns |+---------+| ["9","2002-01-17 02:15:08","0","2011-10-14 13:47:31","20","2","2","1","1","0","19630","","",""," Ave.","Suite ","To || ["10","2002-01-17 02:22:35","0","2011-10-14 13:47:31","10","2","3","2","2","0","19631","","",""," Ave","","York Region"," || ["11","2002-01-17 20:17:27","0","2011-10-14 13:47:32","0","2","2","1","2","0","19632","","","","","","Toronto",""," |+---------+3 rows selected (0.564 seconds)
2015 MapR Technologies 49
CSV SELECT
columns[0], columns[1]
0: jdbc:drill:zk=local> SELECT columns[0], columns[1], columns[2], columns[3], columns[4] FROM dfs.`/tmp/.csv` LIMIT 3;+---------+----------------------+---------+----------------------+---------+| EXPR$0 | EXPR$1 | EXPR$2 | EXPR$3 | EXPR$4 |+---------+----------------------+---------+----------------------+---------+| 9 | 2002-01-17 02:15:08 | 0 | 2011-10-14 13:47:31 | 20 || 10 | 2002-01-17 02:22:35 | 0 | 2011-10-14 13:47:31 | 10 || 11 | 2002-01-17 20:17:27 | 0 | 2011-10-14 13:47:32 | 0 |+---------+----------------------+---------+----------------------+---------+3 rows selected (0.356 seconds)
2015 MapR Technologies 50
CSV SELECT
MySQL
0: jdbc:drill:zk=local> SELECT columns[0] AS id, columns[1] AS createdon, columns[2] AS createdby, columns[3] AS updatedon, columns[4] AS updatedby FROM dfs.`/tmp/.csv` LIMIT 3;+-----+----------------------+------------+----------------------+------------+| id | createdon | createdby | updatedon | updatedby |+-----+----------------------+------------+----------------------+------------+| 9 | 2002-01-17 02:15:08 | 0 | 2011-10-14 13:47:31 | 20 || 10 | 2002-01-17 02:22:35 | 0 | 2011-10-14 13:47:31 | 10 || 11 | 2002-01-17 20:17:27 | 0 | 2011-10-14 13:47:32 | 0 |+-----+----------------------+------------+----------------------+------------+3 rows selected (0.327 seconds)
2015 MapR Technologies 51
CSV SELECT
CSV VARCHAR CAST( AS )
:
0: jdbc:drill:zk=local> SELECT CAST(columns[0] AS INT) AS id, CAST(columns[1] AS TIMESTAMP) AS createdon, CAST(columns[2] AS INT) AS createdby, CAST(columns[3] AS TIMESTAMP) AS updatedon, CAST(columns[4] AS INT) AS updatedby FROM dfs.`/tmp/.csv` LIMIT 3;Error: SYSTEM ERROR: NumberFormatException:
Fragment 1:2
[Error Id: 33d800c9-78ea-473a-8e41-b13e38307af3 on node1:31010] (state=,code=0)
2015 MapR Technologies 52
CSV NULL 1: CASE
2:
CASE WHEN columns[2] = '' THEN NULL ELSE CAST(columns[2] AS INT)END
0: jdbc:drill:zk=local> ALTER SYSTEM SET `drill.exec.functions.cast_empty_string_to_null` = true;+-------+----------------------------------------------------------+| ok | summary |+-------+----------------------------------------------------------+| true | drill.exec.functions.cast_empty_string_to_null updated. |+-------+----------------------------------------------------------+
2015 MapR Technologies 53
CSV SELECT 2 0: jdbc:drill:zk=local> SELECT CAST(columns[0] AS INT) AS id, CAST(columns[1] AS TIMESTAMP) AS createdon, CAST(columns[2] AS INT) AS createdby, CAST(columns[3] AS TIMESTAMP) AS updatedon, CAST(columns[4] AS INT) AS updatedby FROM dfs.`/tmp/.csv` LIMIT 3;+-----+------------------------+------------+------------------------+------------+| id | createdon | createdby | updatedon | updatedby |+-----+------------------------+------------+------------------------+------------+| 9 | 2002-01-17 02:15:08.0 | 0 | 2011-10-14 13:47:31.0 | 20 || 10 | 2002-01-17 02:22:35.0 | 0 | 2011-10-14 13:47:31.0 | 10 || 11 | 2002-01-17 20:17:27.0 | 0 | 2011-10-14 13:47:32.0 | 0 |+-----+------------------------+------------+------------------------+------------+3 rows selected (0.734 seconds)
2015 MapR Technologies 54
25
1 2
0: jdbc:drill:zk=local> SELECT columns[25] AS gender, count(*) AS number, TRUNC(100.0 * count(*) / 31971575, 2) AS percent FROM dfs.`/tmp/.csv` GROUP BY columns[25] ORDER BY columns[25];+---------+-----------+----------+| gender | number | percent |+---------+-----------+----------+| | 9809 | 0.03 || 0 | 2 | 0.0 || 1 | 4414808 | 13.8 || 2 | 27546956 | 86.16 |+---------+-----------+----------+4 rows selected (31.79 seconds)
2015 MapR Technologies 55
0: jdbc:drill:zk=local> SELECT columns[0] AS pnum, columns[1] AS email FROM dfs.`/tmp/.csv` WHERE columns[1] = '[email protected]';+-----------+------------------------------+| pnum | email |+-----------+------------------------------+| 12655726 | [email protected] |+-----------+------------------------------+1 row selected (10.566 seconds)
2015 MapR Technologies 56
/tmp .view.drillJSON
0: jdbc:drill:zk=local> CREATE VIEW dfs.tmp.`` AS SELECT. . . . . . . . . . . > CAST(columns[0] AS INT) AS id,. . . . . . . . . . . > CAST(columns[1] AS TIMESTAMP) AS createdon,. . . . . . . . . . . > CAST(columns[2] AS INT) AS createdby,. . . . . . . . . . . > CAST(columns[3] AS TIMESTAMP) AS updatedon,. . . . . . . . . . . > CAST(columns[4] AS INT) AS updatedby. . . . . . . . . . . > .... . . . . . . . . . . > FROM. . . . . . . . . . . > dfs.`/tmp/.csv`. . . . . . . . . . . > ;
2015 MapR Technologies 57
CSV 2642 $ ls Transactions2008-03-21_downloaded.csv 2010-08-19_downloaded.csv 2013-01-16_downloaded.csv2008-03-22_downloaded.csv 2010-08-20_downloaded.csv 2013-01-17_downloaded.csv2008-03-23_downloaded.csv 2010-08-21_downloaded.csv 2013-01-18_downloaded.csv2008-03-24_downloaded.csv 2010-08-22_downloaded.csv 2013-01-19_downloaded.csv2008-03-25_downloaded.csv 2010-08-23_downloaded.csv 2013-01-20_downloaded.csv2008-03-26_downloaded.csv 2010-08-24_downloaded.csv 2013-01-21_downloaded.csv2008-03-27_downloaded.csv 2010-08-25_downloaded.csv 2013-01-22_downloaded.csv2008-03-28_downloaded.csv 2010-08-26_downloaded.csv 2013-01-23_downloaded.csv2008-03-29_downloaded.csv 2010-08-27_downloaded.csv 2013-01-24_downloaded.csv2008-03-30_downloaded.csv 2010-08-28_downloaded.csv 2013-01-25_downloaded.csv2008-03-31_downloaded.csv 2010-08-29_downloaded.csv 2013-01-26_downloaded.csv2008-04-01_downloaded.csv 2010-08-30_downloaded.csv 2013-01-27_downloaded.csv2008-04-02_downloaded.csv 2010-08-31_downloaded.csv 2013-01-28_downloaded.csv2008-04-03_downloaded.csv 2010-09-01_downloaded.csv 2013-01-29_downloaded.csv...
2015 MapR Technologies 58
10 0: jdbc:drill:zk=local> columns[19] AS TXT_COUNTRY, count(*) AS number from dfs.`/tmp/Transactions` GROUP BY columns[19] ORDER BY count(*) DESC LIMIT 10;Transactions` GROUP BY columns[19] ORDER BY count(*) DESC LIMIT 10;+--------------+----------+| TXT_COUNTRY | number |+--------------+----------+| US | 7591509 || CA | 823746 || BR | 197032 || AU | 146745 || TW | 118338 || CL | 109875 || ZA | 78126 || AR | 75314 || JP | 74165 || GB | 57901 |+--------------+----------+
2015 MapR Technologies 59
CSV $ cd Transactions$ for file in `ls *.csv`; do> dir=`echo $file | cut -c 1-7 | tr - /`> if [ ! -d $dir ]; then> mkdir -p $dir> fi> mv $file $dir> done$ ls2008 2009 2010 2011 2012 2013 2014 2015$ ls 200803 04 05 06 07 08 09 10 11 12$ ls 2008/032008-03-21_downloaded.csv 2008-03-25_downloaded.csv 2008-03-29_downloaded.csv2008-03-22_downloaded.csv 2008-03-26_downloaded.csv 2008-03-30_downloaded.csv2008-03-23_downloaded.csv 2008-03-27_downloaded.csv 2008-03-31_downloaded.csv2008-03-24_downloaded.csv 2008-03-28_downloaded.csv
2015 MapR Technologies 60
dir0,dir1 0: jdbc:drill:zk=local> SELECT dir0 AS year, dir1 AS month, TRUNC(SUM(CAST(REGEXP_REPLACE(REGEXP_REPLACE(columns[2], '^\\(', '-'), ',|\\)', '') AS DOUBLE)), 2) AS amount from dfs.`/tmp/Transactions` WHERE columns[2] 'AMOUNT' GROUP BY dir0, dir1 ORDER BY dir0, dir1;+-------+-------+-----------------+| dir0 | dir1 | amount |+-------+-------+-----------------+| 2008 | 03 | 97676.25 || 2008 | 04 | 266162.39 || 2008 | 05 | 1330456.45 || 2008 | 06 | 1630110.26 || 2008 | 07 | 2590733.03 || 2008 | 08 | 2743130.11 || 2008 | 09 | 2436655.66 || 2008 | 10 | 2534268.59 || 2008 | 11 | 2934391.31 |...
2015 MapR Technologies 61
2015 MapR Technologies 62
Apache Drill
2015 MapR Technologies 63
Q & A @mapr_japan maprjapan
MapR
maprtech
mapr-technologies