Oracle and Hadoop, let them talk together - doag.org · PDF file•Oracle Consultant since 2001 •Former ... (and database usage) approach has ... •BigData Connectors are licensed

Oracle and Hadoop,let them talk together !!!

Laurent Leturgez

Whoami

• Oracle Consultant since 2001

• Former developer (C, Java, perl, PL/SQL)

• Hadoop aficionado

• Owner@Premiseo: Data Management on Premise and in the Cloud

• Blogger since 2004• http://laurent-leturgez.com

• Twitter : @lleturgez

http://laurent-leturgez.com/

3 Membership Tiers• Oracle ACE Director• Oracle ACE• Oracle ACE Associate

bit.ly/OracleACEProgram

500+ Technical Experts Helping Peers Globally

Connect:

Nominate yourself or someone you know: acenomination.oracle.com

@oracleace

Facebook.com/oracleaces

[email protected]

mailto:[email protected]

Hadoop & Oracle: let them talk together

• Agenda• Introduction

• Sqoop: import and export data between from Oracle into Hadoop

• Spark for hadoop

• ODBC Connectors

• Oracle BigData Connectors

• Oracle BigData SQL

• Gluent Data Platform

Introduction

• I felt a great disturbance in the Force !!

Introduction

• The project (and database usage) approach has changed• 10-15 years ago … the « product » approach

• New project

• I usually use Oracle or SQLServer

• So … I will use Oracle or SQL Server for this project

• Nowdays … the « solution » approach• New project

• What kind of data my system will store ?

• Do I have expectation regarding consistency,

security, sizing, pricing/licencing etc. ?

• So … I will use the right tool for the right job !

Companies are nowliving in a

heterogeneous World !!!

Hadoop : What it is, how it works, and what it can do ?

• Hadoop is a framework used:• For distributed storage (HDFS : Hadoop Distributed File System)• To process high volumes of data using MapReduce programming model (not

only)

Designed to scale tens of PetaBytes

Structured and Unstructured Data

Designed to run on Commodity

Servers

For analytics and batch workloads

• Hadoop is open source• But some enterprise distributions exist



• One Data, multiple processing engines

HDFS

PARQUET AVRODELIMITED

FILES

Hive Mahout

Map Reduce / YARN

Flume Spark

ORC

Right tool for the right job ?

• Oracle• Perfect for OLTP processing• Ideal for all in One databases :

• Structured : tables, constraints, typed data etc.• Unstructured : Images, videos, binary• Many format: XML, JSON• Sharded database

• Hadoop• Free and scalable• Many open data formats (Avro, Parquet, Kudu etc.) • Many processing tools (MapReduce, Spark, Kafka etc.)• Analytic workloads• Design to manage large amounts of data quickly

How can I connect Hadoop to Oracle, Oracle to Hadoop, and

query data ?

Which solutions to exchange data between Oracle and Hadoop ?

How can I reusemy Oracle data

in a hadoopworkload




• ODBC Connectors







• ODBC Connectors





• Sqoop is a tool to move data between rdbms and Hadoop (HDFS)

• Basically, a tool to run data export and import from hadoop cluster

• Scenarios• Enrich analytic workloads with multiple data sources

RDBMS / Oracle

Hadoop/ HDFS

AnalyticWorkload

Results / HDFS

RDBMS / Oracle

UnstructuredData

SqoopOther : Flume,Kafka



• Basically, a tool to run data export and import

• Scenarios • Offload analytic workloads on hadoop

RDBMS / Oracle

Hadoop/ HDFS

AnalyticWorkload

Results / HDFS

RDBMS / Oracle

Sqoop




• Scenarios• Offload analytic workloads on hadoop and keep data on HDFS

RDBMS / Oracle

Hadoop/ HDFS

AnalyticWorkload

Results / HDFS

Sqoop




• Scenarios• Data archiving into Hadoop

RDBMS / Oracle

Hadoop/ HDFS

Compressed filesetsHDFS

Sqoop


• Sqoop import is from RDBMS to Hadoop

$ sqoop import \

> --connect jdbc:oracle:thin:@//192.168.99.8:1521/orcl \

> --username sh --password sh \

> --table S \

> --hive-import \

> --hive-overwrite \

> --hive-table S_HIVE \

> --hive-database hive_sample \

> --split-by PROD_ID -m 4 \

> --map-column-hive "TIME_ID"=timestamp

0: jdbc:hive2://hadoop1:10001/hive_sample> show create table s_hive;

+----------------------------------------------------+--+

| createtab_stmt |

+----------------------------------------------------+--+

| CREATE TABLE `s_hive`( |

| `prod_id` double, |

| `cust_id` double, |

| `time_id` timestamp, |

| `channel_id` double, |

| `promo_id` double, |

| `quantity_sold` double, |

| àmount_sold` double) |

| COMMENT 'Imported by sqoop on 2017/09/18 14:55:40' |

| ROW FORMAT SERDE |

| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |

| WITH SERDEPROPERTIES ( |

| 'field.delim'='\u0001', |

| 'line.delim'='\n', |

| 'serialization.format'='\u0001') |

| STORED AS INPUTFORMAT |

| 'org.apache.hadoop.mapred.TextInputFormat' |

| OUTPUTFORMAT |

| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |

| LOCATION |

| 'hdfs://hadoop1.localdomain:8020/user/hive/warehouse/hive_sample.db/s_hive' |

| TBLPROPERTIES ( |

| .../... ) |

+----------------------------------------------------+--+

$ hdfs dfs -ls -C /user/hive/warehouse/hive_sample.db/s_hive*

/user/hive/warehouse/hive_sample.db/s_hive/part-m-00000




0: jdbc:hive2://hadoop1:10001/hive_sample> select count(*) from s_hive;

+----------+--+

| _c0 |

+----------+--+

| 2756529 |

+----------+--+


• Sqoop import is from RDBMS to Hadoop• One Oracle session per mapper

• Reads are done in direct path mode

• SQL Statement can be used to filter data to import

• Results can be stored in various format: delimited text, hive, parquet, compressed or not

• Key issue is Data type conversion• Hive Datatype mapping (--map-column-hive "TIME_ID"=timestamp)

• Java Datatype mapping (--map-column-java "ID"=Integer, "VALUE"=String)


• Sqoop export is from Hadoop to RDBMS• Destination table has to be created first

• Direct mode is possible

• Two modes• Insert mode

• Update mode

$ sqoop export \

> --connect jdbc:oracle:thin:@//192.168.99.8:1521/orcl

> --username sh --password sh \

> --direct \

> --table S_RESULT \

> --export-dir=/user/hive/warehouse/hive_sample.db/s_result \

> --input-fields-terminated-by '\001'SQL> select * from sh.s_result;

PROD_ID P_SUM P_MIN

---------- ---------- ----------

47 1132200.93 25.97

46 749501.85 21.23

45 1527220.89 42.09

44 889945.74 42.09

…/…




• ODBC Connectors





• Spark for hadoop• Spark is an Open Source distributed computing framework• Fault tolerant by design• Can work with various cluster managers

• YARN• MESOS• Spark Standalone• Kubernetes (Experimental)

• Centered on a data structure called RDD (Resilient Distributed Dataset)• Based on various components

• Spark Core• Spark SQL (Data Abstraction)• Spark streaming (Data Ingestion)• Spark MLlib (Machine Learning)• Spark Graph (Graph processing on top of Spark) Spark Core

Spark SQLSpark

StreamingSpark MLLib

Spark Graph


• Spark for hadoop• Resilient Distributed DataSets (RDD)

• Read Only Structure

• Distributed over cluster’s machines

• Maintained in a fault tolerant Way


• Spark for hadoop• Evolution

• RDD, DataFrames, and DataSets can be filled from an Oracle Data source

RDDDataFrame :

RDD + NamedColumns

organization

DataSets : DataFrame

specialization

Untyped : DataFrame=DataSet[Row]

or

Typed : DataSet[T]

Best for Spark SQL


• Spark for hadoop• Spark API languages


• Spark for hadoop : Spark vs MapReduce• MR is batch oriented (Map then Reduce), Spark is Real Time

• MR stores data on Disk, Spark stores data in Memory

• MR is written in Java, Spark is written in Scala

• Performance comparison• WordCount on a file of 2Gb

• Execution time with and without optimization (Mapper, Reducer, memory, partitioning etc.) Cf. http://repository.stcloudstate.edu/cgi/viewcontent.cgi?article=1008&context=csit_etds

MR Spark

Without optimization 3’ 53’’ 34’’

With optimization 2’ 23’’ 29’’

5x. faster

http://repository.stcloudstate.edu/cgi/viewcontent.cgi?article=1008&context=csit_etds


• Spark for hadoop and Oracle … use cases

RDBMS / Oracle

SparkAnalytic

Workload/ ML algorithm

RDD / DataFrames

HDFS

UnstructuredData

Spark / JDBC

Other datasources: S3 etc.

RDBMS / Oracle


• Spark for hadoop and Oracle … example (Spark 1.6 / CDH 5.9)scala> val s = sqlContext.read.format("jdbc").option("url", "jdbc:oracle:thin:@//192.168.99.8:1521/orcl").option("driver",

"oracle.jdbc.OracleDriver").option("dbtable", "sales").option("user", "sh").option("password", "sh").load()

scala> val p = sqlContext.read.format("com.databricks.spark.csv").option("delimiter",";").load("/user/laurent/products.csv")

--

scala> s.registerTempTable("sales")

scala> p.registerTempTable("products")

--

scala> val f = sqlContext.sql(""" select products.prod_name,sum(sales.amount_sold) from sales, products where sales.prod_id=products.prod_id group by

products.prod_name""")

scala> f.write.parquet("/user/laurent/spark_parquet")

$ impala-shell

[hadoop1.localdomain:21000] > create external table t

> LIKE PARQUET '/user/laurent/spark_parquet/_metadata'

> STORED AS PARQUET location '/user/laurent/spark_parquet/’;

[hadoop1.localdomain:21000] > select * from t;

+------------------------------------------------+-------------+

| prod_name | _c1 |

+------------------------------------------------+-------------+

| O/S Documentation Set - Kanji | 509073.63 |

| Keyboard Wrist Rest | 348408.98 |

| Extension Cable | 60713.47 |

| 17" LCD w/built-in HDTV Tuner | 7189171.77 |

.../...




• ODBC Connectors





• ODBC Connectors• Cloudera delivers ODBC drivers for

• Hive

• Impala

• ODBC drivers are available for :• HortonWorks (Hive, SparkSQL)

• MapR (Hive)

• AWS EMR (Hive, Impala, Hbase)

• Azure HDInsight (Hive) Only for Windows Client


• ODBC Connectors• Install driver on Oracle Host

• Configure ODBC on Oracle host

• Configure a heterogeneous datasource based on ODBC

• Create a database link using this datasource

Service dg4odbc


• ODBC Connectors

Lsnr

DSN (odbc.ini)

Driver Manager

Odbc Driver

Non Oracle Client

Tnsnames.ora

Join

select c1,c2

from orcl_t a,

t@hive_lnk b

where a.id=b.id

and b>1000

Filter predicatesare pushed down to hive / Impala


------------------------------------------------------------------------------------------------

SQL_ID abb5vb85sd3kt, child number 0

-------------------------------------

select p.prod_name,sum(s."quantity_sold") from products p,

s_hive@hivedsn s where p.prod_id=s."prod_id" and s."amount_sold">1500

group by p.prod_name

Plan hash value: 3779319722

------------------------------------------------------------------------------------------------

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Inst |IN-OUT|

------------------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | 204 (100)| | | |

| 1 | HASH GROUP BY | | 71 | 4899 | 204 (1)| 00:00:01 | | |

|* 2 | HASH JOIN | | 100 | 6900 | 203 (0)| 00:00:01 | | |

| 3 | TABLE ACCESS FULL| PRODUCTS | 72 | 2160 | 3 (0)| 00:00:01 | | |

| 4 | REMOTE | S_HIVE | 100 | 3900 | 200 (0)| 00:00:01 | HIVED~ | R->S |

------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("P"."PROD_ID"="S"."prod_id")

Remote SQL Information (identified by operation id):

----------------------------------------------------

4 - SELECT `prod_id`,`quantity_sold`,àmount_sold` FROM `S_HIVE` WHERE

àmount_sold`>1500 (accessing 'HIVEDSN' )

hive> show create table s_hive;

.../...

+----------------------------------------------------+--+

| createtab_stmt |

+----------------------------------------------------+--+

| CREATE TABLE `s_hive`( |

.../...

| TBLPROPERTIES ( |

| 'COLUMN_STATS_ACCURATE'='true', |

| 'numFiles'='6', |

| 'numRows'='2756529', |

| 'rawDataSize'='120045918', |

| 'totalSize'='122802447', |

| 'transient_lastDdlTime'='1505740396') |

+----------------------------------------------------+--

SQL> select count(*) from s_hive@hivedsn where "amount_sold">1500;

COUNT(*)

----------

24342




• ODBC Connectors





• Oracle BigData Connectors• Components available

• Oracle Datasource for Apache Hadoop• Oracle Loader for Hadoop• Oracle SQL Connector for HDFS• Oracle R Advanced Analytics for Hadoop• Oracle XQuery for Hadoop• Oracle Data Integrator Enterprise Edition

• BigData Connectors are licensed separately from Big Data Appliance (BDA)• BigData Connectors can be installated on BDA or any hadoop Cluster• BigData Connectors must be licensed for all processors of a hadoop cluster

• public price: $2000 per Oracle Processor


• Oracle BigData Connectors : Oracle Datasource for Apache Hadoop• Available for Hive and Spark

• Enables Oracle table as datasource in Hive or Spark

• Based on Hive external tables• Metadata is stored in Hcatalog

• Data is located on Oracle Server

• Secured (Wallet and Kerberos integration)

• Writing data from hive to Oracle is possible

• Performance• Filter predicates are pushed down

• Projection Pushdown to retrieve only required columns

• Partition pruning enabled


• Oracle BigData Connectors : Oracle Datasource for Apache Hadoophive> create external table s_od4h (

> prod_id double,

> cust_id double,

> time_id timestamp,

> channel_id double,

> promo_id double,

> quantity_sold double,

> amount_sold double)

> STORED BY 'oracle.hcat.osh.OracleStorageHandler'

> WITH SERDEPROPERTIES (

> 'oracle.hcat.osh.columns.mapping' = 'prod_id,cust_id,time_id,channel_id,promo_id,quantity_sold,amount_sold')

> TBLPROPERTIES (

> 'mapreduce.jdbc.url' = 'jdbc:oracle:thin:@//192.168.99.8:1521/orcl',

> 'mapreduce.jdbc.input.table.name' = 'SALES',

> 'mapreduce.jdbc.username' = 'SH',

> 'mapreduce.jdbc.password' = 'sh',

> 'oracle.hcat.osh.splitterKind' = 'SINGLE_SPLITTER'

> );

hive> select count(prod_id) from s_od4h;


• Oracle BigData Connectors : Oracle Loader for Hadoop• Load data from hadoop into Oracle table

• Java Map Reduce application

• Online and offline mode

• Need many input files (XML)• A loadermap: describes destination table (Types, format etc.).

• An input file description: AVRO, Delimited, KV (if Oracle NoSQL file)

• An output file description: JDBC (Online), OCI(Online), Delimited (Offline), DataPump(Offline)

• A Database connection description file


• Oracle BigData Connectors : Oracle Loader for Hadoop

• Oracle Shell for Hadoop Loaders (OHSH)• Set of declarative commands to copy contents from Oracle to Hadoop (Hive)

• Need « Copy To Hadoop » which is included in BigData SQL licence

$ hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader \

-D oracle.hadoop.loader.jobName=HDFSUSER_sales_sh_loadJdbc \

-D mapred.reduce.tasks=0 \

-D mapred.input.dir=/user/laurent/sqoop_raw \

-D mapred.output.dir=/user/laurent/OracleLoader \

-conf /home/laurent/OL_connection.xml \

-conf /home/laurent/OL_inputFormat.xml \

-conf /home/laurent/OL_mapconf.xml \

-conf /home/laurent/OL_outputFormat.xml


• Oracle BigData Connectors : Oracle SQL Connector for Hadoop• Java MapReduce application• Retrieve data from Hadoop into Oracle External Table • Same Oracle External tables’ limitations

• No insert, update or delete• Parallel query enable with automatic load balancing• Full Scan• Indexing is not possible

• Two commands: • createTable: create external table and link local files to hdfs files• publish: refresh files in table DDL

• Can be used to read :• Datapump files on HDFS• Delimited text files on HDFS• Delimited text files on Hive Tables

Data is not available in real

time


• Oracle BigData Connectors : Oracle SQL Connector for Hadoop• CreateTable

$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar \

oracle.hadoop.exttab.ExternalTable \

-D oracle.hadoop.exttab.tableName=T1_EXT \

-D oracle.hadoop.exttab.sourceType=hive \

-D oracle.hadoop.exttab.hive.tableName=T1 \

-D oracle.hadoop.exttab.hive.databaseName=hive_sample \

-D oracle.hadoop.exttab.defaultDirectory=SALES_HIVE_DIR \

-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//192.168.99.8:1521/ORCL \

-D oracle.hadoop.connection.user=sh \

-D oracle.hadoop.exttab.printStackTrace=true \

-createTable

CREATE TABLE "SH"."T1_EXT"

( "ID" NUMBER(*,0),

"V" VARCHAR2(4000)

)

ORGANIZATION EXTERNAL

( TYPE ORACLE_LOADER

DEFAULT DIRECTORY "SALES_HIVE_DIR"

ACCESS PARAMETERS

( RECORDS DELIMITED BY 0X'0A'

CHARACTERSET AL32UTF8

PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'

FIELDS TERMINATED BY 0X'01'

MISSING FIELD VALUES ARE NULL

(

"ID" CHAR NULLIF "ID"=0X'5C4E',

"V" CHAR(4000) NULLIF "V"=0X'5C4E'

)

)

LOCATION

( 'osch-20170919025617-5707-1',

'osch-20170919025617-5707-2',

'osch-20170919025617-5707-3'

)

)

REJECT LIMIT UNLIMITED

PARALLEL

$ grep uri /data/sales_hive/osch-20170919025617-5707-1

<uri_list>

<uri_list_item size="9" compressionCodec="">

hdfs://hadoop1.localdomain:8020/user/hive/warehouse/hive_sample.db/t1/000000_0

</uri_list_item>

</uri_list>




• ODBC Connectors





• Oracle BigData SQL• Support for queries against non-relational datasources

• Apache Hive• HDFS• Oracle NoSQL• Apache Hbase• Other NoSQL Databases

• Cold tablespace (and datafiles) storage on Hadoop/HDFS• Licensing

• BigData SQL is licensed separately from Big Data appliance• Installation on a BDA is not mandatory• BigData SQL is licensed per-disk drive per hadoop cluster

• Public price : $4000 per disk drive

• All disks in a hadoop cluster have to be licensed


• Oracle BigData SQL• Three phases installation

• BigDataSQL Parcel deployment (CDH)

• Database Server bundle configuration

• Package deployment on the database Server

• For Oracle 12.1.0.2 and above

• Need some patches !! Oracle Big Data SQL Master Compatibility Matrix (Doc ID 2119369.1)


• Oracle BigData SQL : Features• External Table with new Access drivers

• ORACLE_HIVE: Existing Hive tables. Metadata is stored in Hcatalog• ORACLE_HDFS: Create External table directly on HDFS. Metadata is declared through access

parameters (mandatory)• Smart Scan for HDFS

• Oracle External Tables typically require Full Scan• BigData SQL extends Smart Scan capabilities to External Tables:

• smaller result sets send to Oracle Server• Data movement and network traffic reduced

• Storage Indexes (Only for HIVE and HDFS sources)• Oracle External Tables cannot have indexes• BigData SQL maintains SI automatically• Available for = , <, <=, !=, =>, >, IS NULL and IS NOT NULL

• Predicate pushdown• Read only tablespaces on HDFS


• Oracle BigData SQL : exampleSQL_ID dm5u21rng1mf4, child number 0

-------------------------------------

select p.prod_name,sum(s."QUANTITY_SOLD") from products p,

laurent.sales_hdfs s where p.prod_id=s."PROD_ID" and

s."AMOUNT_SOLD">300 group by p.prod_name

Plan hash value: 4039843832

---------------------------------------------------------------------------------------------------

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |

---------------------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | 1364 (100)| |

| 1 | HASH GROUP BY | | 71 | 4899 | 1364 (1)| 00:00:01 |

|* 2 | HASH JOIN | | 20404 | 1374K| 1363 (1)| 00:00:01 |

| 3 | TABLE ACCESS FULL | PRODUCTS | 72 | 2160 | 3 (0)| 00:00:01 |

|* 4 | EXTERNAL TABLE ACCESS STORAGE FULL| SALES_HDFS | 20404 | 777K| 1360 (1)| 00:00:01 |

---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("P"."PROD_ID"="S"."PROD_ID")

4 - filter("S"."AMOUNT_SOLD">300)

24 rows selected.


• Oracle BigData SQL : Read only Tablespace offload• Move cold data in a read only tablespace to HDFS

• Use a FUSE mount point to HDFS rootSQL> select tablespace_name,STATUS from dba_tablespaces where tablespace_name='MYTBS';

TABLESPACE_NAME STATUS

------------------------------ ---------

MYTBS READ ONLY

SQL> select tablespace_name,status,file_name from dba_data_files where tablespace_name='MYTBS';

TABLESPACE_NAME STATUS FILE_NAME

------------------------------ --------- --------------------------------------------------------------------------------

MYTBS AVAILABLE /u01/app/oracle/product/12.1.0/dbhome_1/dbs/hdfs:cluster/user/oracle/cluster-oel

6.localdomain-orcl/MYTBS/mytbs01.dbf

[oracle@oel6 MYTBS]$ pwd

/u01/app/oracle/product/12.1.0/dbhome_1/dbs/hdfs:cluster/user/oracle/cluster-oel6.localdomain-orcl/MYTBS

[oracle@oel6 MYTBS]$ ls -l

total 4

lrwxrwxrwx 1 oracle oinstall 82 Sep 19 18:21 mytbs01.dbf -> /mnt/fuse-cluster-hdfs/user/oracle/cluster-oel6.localdomain-

orcl/MYTBS/mytbs01.dbf

$ hdfs dfs -ls /user/oracle/cluster-oel6.localdomain-orcl/MYTBS/mytbs01.dbf

-rw-r--r-- 3 oracle oinstall 104865792 2017-09-19 18:21 /user/oracle/cluster-oel6.localdomain-orcl/MYTBS/mytbs01.dbf




• ODBC Connectors





• Gluent Data Platform• Present Data stored in Hadoop (in various formats) to any compatible rDBMS

(Oracle, SQL Server, Teradata)

• Offload your data and your workload into hadoop• Table or Contiguous partitions

• Take benefits of a distributed platform (Storage and processing)

• Advise you which schema or data can be safely offloaded to Hadoop


• Conclusion• Hadoop integration to Oracle can help

• To take advantages of distributed storage and processing

• To optimize storage placement

• To reduce TCOs (workload offloading, Oracle DataMining Option etc.)

• Many scenarios

• Many products for many solutions

• Many Prices

• Choose the best solutions for your specific problematics !!!

Questions ?

Documents

Oracle and Hadoop, let them talk together - doag.org · PDF file•Oracle Consultant since 2001 •Former ... (and database usage) approach has ... •BigData Connectors are licensed