Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW Developers

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Lesson 1 - Introduction to Hadoop & Big Data Technologies for Oracle BI/DWMark Rittman, CTO, Rittman Mead SIOUG and HROUG Conferences, Oct 2014





About the Speaker

•Mark Rittman, Co-Founder of Rittman Mead •Oracle ACE Director, specialising in Oracle BI&DW •14 Years Experience with Oracle Technology •Regular columnist for Oracle Magazine •Author of two Oracle Press Oracle BI books •Oracle Business Intelligence Developers Guide •Oracle Exalytics Revealed •Writer for Rittman Mead Blog :http://www.rittmanmead.com/blog

•Email : [email protected] •Twitter : @markrittman

http://www.rittmanmead.com/blog

mailto:[email protected]





About Rittman Mead

•Oracle BI and DW Gold partner •Winner of five UKOUG Partner of the Year awards in 2013 - including BI •World leading specialist partner for technical excellence, solutions delivery and innovation in Oracle BI

•Approximately 80 consultants worldwide •All expert in Oracle BI and DW •Offices in US (Atlanta), Europe, Australia and India •Skills in broad range of supporting Oracle tools: ‣OBIEE, OBIA ‣ODIEE ‣Essbase, Oracle OLAP ‣GoldenGate ‣Endeca



Traditional Data Warehouse / BI Architectures

Staging Foundation / ODS

Performance /Dimensional

ETL ETL

BI Tool (OBIEE)with metadatalayer

OLAP / In-Memory Tool with data load into own database

DirectRead

DataLoad

Traditional structureddata sources

DataLoad

DataLoad

DataLoad

Traditional Relational Data Warehouse

•Three-layer architecture - staging, foundation and access/performance •All three layers stored in a relational database (Oracle) •ETL used to move data from layer-to-layer



Tools We Use…

•Oracle Database Enterprise Edition •Oracle Business Intelligence Enterprise Edition •Oracle Data Integrator •Oracle Warehouse Builder •SQL & PL/SQL •ApEx •etc…



Introducing Hadoop

•A new approach to data processing and data storage •Rather than a small number of large, powerful servers, it spreads processing overlarge numbers of small, cheap, redundant servers

•Spreads the data you’re processing over lots of distributed nodes

•Has scheduling/workload process that sends parts of a job to each of the nodes- a bit like Oracle Parallel Execution

•And does the processing where the data sits - a bit like Exadata storage servers

•Shared-nothing architecture •Low-cost and highly horizontal scalable

Job Tracker

Task Tracker Task Tracker Task Tracker Task Tracker

Data Node Data Node Task Tracker Task Tracker



BigDataLite Demonstration VM

•Demo / Training VM downloadable from OTN •Contains Cloudera Hadoop + Oracle Big Data Connectors + Big Data SQL •Similar to setup on Oracle BDA •Contains OBIEE enabling technologies: ‣Apache Hive (SQL access over Hadoop) ‣Apache HDFS (file storage) ‣Oracle Direct Connector for HDFS ‣Oracle R Advanced Analytics for Hadoop ‣Oracle Big Data SQL

•Great way to get started with Hadoop ‣Requires 8GB RAM, modern laptop etc



Cloudera Distribution including Hadoop (CDH)

•Like Linux, you can set up your Hadoop system manually, or use a distribution •Key Hadoop distributions include Cloudera CDH, Hortonworks HDP, MapR etc •Cloudera CDH is the distribution Oracle use on Big Data Appliance ‣Provides HDFS and Hadoop framework for BDA ‣Includes Pig, Hive, Sqoop, Oozie, HBase ‣Cloudera Impala for real-time SQL access ‣Cloudera Manager & Hue



Cloudera Manager and Hue

•Web-based tools provided with Cloudera CDH •Cloudera Manager used for cluster admin,maintenance (like Enterprise Manager ‣Commercial tool developed by Cloudera ‣Not enabled by default in BigDataLite VM

•Hue is a developer / analyst tool for working with Pig, Hive, Sqoop, HDFS etc ‣Open source project included in CDH



DemoTaking a look at the Big Data Lite VM



Why is Hadoop of Interest to Us?

•Gives us an ability to store more data, at more detail, for longer •Provides a cost-effective way to analyse vast amounts of data •Hadoop & NoSQL technologies can give us “schema-on-read” capabilities •There’s vast amounts of innovation in this area we can harness •And it’s very complementary to Oracle BI & DW



Hadoop Horizontal Scalability, and Economics

•Hadoop is easily to scale horizontally - designed for this at start •Only needs low-cost, commodity hardware - anticipates hardware failure •Easy to add nodes after initial install •Can deploy into the cloud (AWS etc) or run on-premise, managed service etc

•Typical clusters from 10 - 1000+ nodes •10% of the cost/TB compared to RBDMS DWs ‣Cheaper hardware ‣Low-cost or OSS software ‣Lower TCO

www.rittmanmead.com [email protected] @rittmanmead

•Traditional “Schema on Write” ‣Data quality managed by formalised ETL process ‣Data persisted in tabular, agreed and consistent form ‣Data integration happens in ETL ‣Structure must be decided before writing

•Big Data “Schema on Read” ‣Interpretation of data captured in code for each program accessing the data ‣Data quality dependent on code quality ‣Data integration happens in code

DQ Bus. Rules Mapping

ETL

Data pools

Schema on Read vs. Schema on Write

http://www.google.co.uk/url?sa=i&rct=j&q=cogs+in+a+process&source=images&cd=&cad=rja&docid=B6mwkIDSzn_4lM&tbnid=L17XpYh5mvvreM:&ved=0CAUQjRw&url=http://www.corporatemodelling.com/about/&ei=pjIGUbTbNofKtAaExoDwCA&psig=AFQjCNGpZyKbZloxBujwy0vs6qwFPpj_pg&ust=1359447079273614

http://www.google.co.uk/url?sa=i&rct=j&q=cognitive+behaviour&source=images&cd=&cad=rja&docid=3Mf7pmClueX_IM&tbnid=m7HyTFDTDLE0nM:&ved=0CAUQjRw&url=http://www.naturaltherapypages.co.uk/article/characteristics_of_cognitive_behavioural_therapy&ei=2zEGUYSFCsnVtQbKoYCQDg&psig=AFQjCNHGQlmkB77A4GyauzvL--2YOhPzSA&ust=1359446875680961








Data Analysis is Changing…

• In the past, it’s been sufficient to just consider transactional data for analysis & reporting •Now, customers expect to consider all their data when making key business decisions



Hadoop Tenets : Simplified Distributed Processing

•Hadoop, through MapReduce, breaks processing down into simple stages ‣Map : select the columns and values you’re interested in, pass through as key/value pairs ‣Reduce : aggregate the results

•Most ETL jobs can be broken down into filtering, projecting and aggregating

•Hadoop then automatically runs job on cluster ‣Share-nothing small chunks of work ‣Run the job on the node where the data is ‣Handle faults etc ‣Gather the results back in

Mapper Filter, Project



Reducer Aggregate

Reducer Aggregate

Output One HDFS file per reducer,in a directory



The Hadoop Framework

•Jobs are submitted as JAR files to Master Node •Master node runs the Job Tracker •Slave Nodes hold replicated parts of whole dataset •Job Tracker sends jobs to Slave Node Task Trackers •Task trackers run MapReduce against HDFS data •Jobs run in parallel, sending results back to Jobtracker and then the client applications ‣Everything runs automatically in parallel ‣Hadoop takes care of orchestration, fault-tolerance ‣Simple design, crazily scalable



HDFS: Low-Cost, Clustered, Fault-Tolerant Storage

•The filesystem behind Hadoop, used to store data for Hadoop analysis ‣Unix-like, uses commands such as ls, mkdir, chown, chmod

•Fault-tolerant, with rapid fault detection and recovery •High-throughput, with streaming data access and large block sizes •Designed for data-locality, placing data closed to where it is processed •Accessed from the command-line, via internet (hdfs://), GUI tools etc

[oracle@bigdatalite mapreduce]$ hadoop fs -mkdir /user/oracle/my_stuff [oracle@bigdatalite mapreduce]$ hadoop fs -ls /user/oracle Found 5 items drwx------ - oracle hadoop 0 2013-04-27 16:48 /user/oracle/.staging drwxrwxrwx - oracle hadoop 0 2012-09-18 17:02 /user/oracle/moviedemo drwxrwxrwx - oracle hadoop 0 2012-10-17 15:58 /user/oracle/moviework drwxrwxrwx - oracle hadoop 0 2013-05-03 17:49 /user/oracle/my_stuff drwxrwxrwx - oracle hadoop 0 2012-08-10 16:08 /user/oracle/stage



DemoWorking with HDFS and Hadoop FS Shell



Core Apache Hadoop Tools

•Apache Hadoop, including MapReduce and HDFS ‣Scaleable, fault-tolerant file storage for Hadoop ‣Parallel programming framework for Hadoop

•Apache Hive ‣SQL abstraction layer over HDFS ‣Perform set-based ETL within Hadoop

•Apache Pig, Spark ‣Dataflow-type languages over HDFS, Hive etc ‣Extensible through UDFs, streaming etc

•Apache Flume, Apache Sqoop, Apache Kafka ‣Real-time and batch loading into HDFS ‣Modular, fault-tolerant, wide source/target coverage



Hadoop 2.0 : YARN, Spark and Tez

•Hadoop 2.0 breaks the link between Hadoop and MapReduce •Separates out resource management from job scheduling • Introduces a new component - YARN ‣“Yet Another Resource Manager” ‣Backwards-compatible with Hadoop 1.0

•Makes Hadoop and YARN more of an “OS” •Makes it possible to run other processingtypes on Hadoop ‣For example, Apache Spark, Apache Tez

•Now used in CDH5, Hadoop 2.0 etc



NoSQL Databases

•Family of database types that reject tabular storage, SQL access and ACID compliance

•Focus is on scalability, speed and schema-on-read ‣Oracle NoSQL Database - speed and scalability ‣Apache HBase - speed, scalability and Hadoop ‣MongoDB - native storage of JSON documents

•May or may not run on Hadoop, but associated with it •Great choice for high-velocity data capture •CRUD approach vs write-once/read many in HDFS



Oracle’s Big Data Products

•Oracle Big Data Appliance ‣Optimized hardware for Hadoop processing ‣Cloudera Distribution incl. Hadoop ‣Oracle Big Data Connectors, ODI etc

•Oracle Big Data Connectors •Oracle Big Data SQL •Oracle NoSQL Database •Oracle Data Integrator •Oracle R Distribution •OBIEE, BI Publisher and Endeca Info Discovery



Oracle Big Data Appliance

•Engineered system for big data processing and analysis •Optimized for enterprise Hadoop workloads •288 Intel® Xeon® E5 Processors •1152 GB total memory •648TB total raw storage capacity ‣Cloudera Distribution of Hadoop ‣Cloudera Manager ‣Open-source R ‣Oracle NoSQL Database Community Edition ‣Oracle Enterprise Linux + Oracle JVM ‣New - Oracle Big Data SQL



Part of the Wider Engineered Systems Platform



Oracle NoSQL Database

•Key-Value store database built on BerkeleyDB •Aimed at applications that need to store high-volumes of schema-on-read data •Typically works in conjunction with real-time event capture and processing •Highly-scalable, choice over consistency modes



Just Released - Oracle Big Data SQL

•Part of Oracle Big Data 4.0 (BDA-only) ‣Also requires Oracle Database 12c, Oracle Exadata Database Machine

•Extends Oracle Data Dictionary to cover Hive •Extends Oracle SQL and SmartScan to Hadoop •Extends Oracle Security Model over Hadoop ‣Fine-grained access control ‣Data redaction, data masking

Exadata Storage Servers

HadoopCluster

Exadata DatabaseServer

Oracle Big Data SQL

SQL Queries

SmartScan SmartScan



Coming Soon : Oracle Big Data Discovery

•Combining of Endeca Server search, analysis and visualisation capabilitieswith Apache Spark data munging and transformation ‣Analyse, parse, explore and “wrangle” data using graphical tools and a Spark-based transformation engine ‣Create a catalog of the data on your Hadoop cluster, then search that catalog using Endeca Server ‣Create recommendations of other datasets, based on what you’re looking at now ‣Visualize your datasets, discover new insights



Coming Soon : Oracle Data Enrichment Cloud Service

•Cloud-based service for loading, enriching, cleansing and supplementing Hadoop data •Part of the Oracle Data Integration product family •Used up-stream from Big Data Discovery •Aims to solve the “data quality problem” for Hadoop



Bringing it All Together : Oracle Data Integrator 12c

•ODI provides an excellent framework for running Hadoop ETL jobs ‣ELT approach pushes transformations down to Hadoop - leveraging power of cluster

•Hive, HBase, Sqoop and OLH/ODCH KMs provide native Hadoop loading / transformation ‣Whilst still preserving RDBMS push-down ‣Extensible to cover Pig, Spark etc

•Process orchestration •Data quality / error handling •Metadata and model-driven



Oracle & Hadoop Use-Cases

•Use Hadoop as a low-cost, horizontally-scalable DW archive •Use Hadoop, Hive and MapReduce for low-cost ETL staging •Support standalone-Hadoop analysis with Oracle reference data •Extend the DW with new data sources, datatypes, detail-level data



New Oracle Information Management Ref ArchitectureActionable

Events

Event Engine Data Reservoir

Data Factory Enterprise Information Store

Reporting

Discovery Lab

Actionable Information

ActionableInsights

Input Events

Execution

Innovation

Discovery Output

Events & Data

Structured Enterprise Data

Other Data



Why the Word “Reservoir”?

•A reservoir is a lake than also can process and refine (your data) •Wide-ranging source of low-density, lower-value data to complement the DW



Benefits of a Data Reservoir



Information Management Architecture - Logical View

Virtu

aliz

atio

n &

Q

uery

Fed

erat

ion

Enterprise Performance Management

Pre-built & Ad-hoc BI Assets

Information Services

Data Ingestion

Information Interpretation

Access & Performance Layer

Foundation Data Layer

Raw Data Reservoir

Data Science

Data Engines & Poly-structured sources

Content

Docs Web & Social Media

SMS

Structured Data Sources

•Operational Data •COTS Data •Master & Ref. Data •Streaming & BAM

Immutable raw data reservoir Raw data at rest is not interpreted

Immutable modelled data. Business Process Neutral form. Abstracted from business process changes

Past, current and future interpretation of enterprise data. Structured to support agile access & navigation

Discovery Lab Sandboxes Rapid Development Sandboxes

Project based data stores to support specific discovery objectives

Project based data stored to facilitate rapid content / presentation delivery

Data Sources



Data Layers - Cost, Quality and Concurrency Trade-off



Three Stages of BI/DW Hadoop Development (Now)

Data prep via R scripts, Python scriptsetc !!!

Data LoadingReal-time via Flume Conf scripts Batch via Sqoop cmd-line exec

Sharing output via Hive tables, Impala tables, HDFS files etc !!!

Data ExportBatch via Sqoop cmd-line exec

a.k.a. “data munging”

Data analysis via R scripts, Python scripts, Pig, Spark etc !!!!!!!

a.k.a. “the magic”

“Discovery” phase “Exploitation” phase



Three Stages of BI/DW Hadoop Development (Future)

Data loading via Oracle Data Enrichment / ODI /GoldenGate > Flume !!!

Data analysis via ODECS and Big Data Discovery !!!!!!!

Sharing output via Big Data SQL !!!!

Data ExportBatch via ODI / Oracle Big Data Connectors

a.k.a. “data munging” a.k.a. “the magic”

“Discovery” phase “Exploitation” phase



What We’ll Cover in this Seminar…

•Lesson 2 : Getting data into Hadoop ‣Real-time and batch, from log/event activity and from files/databases ‣Using Apache Hadoop tools, and using Oracle Data Integrator / Oracle GoldenGate

•Lesson 3 : Processing, Analyzing and Transforming Data in Hadoop ‣Using Apache Hive, Apache Pig and Apache Spark ‣Using Hive, Pig and Spark functionality to transform data into Information ‣Where R and Oracle R Advanced Analytics for Hadoop can be used ‣Using the Oracle Big Data Connectors to export Hadoop data to Oracle RBDMS ‣Using Oracle Data Integrator and Oracle Big Data SQL to accelerate the process

•Lesson 4 : Visualizing Hadoop Datasets using OBIEE, BI Publisher and Endeca ‣Connecting OBIEE to Hadoop via Hive, Impala and Big Data Connectors ‣Visualizing R insights using BI Publisher and OBIEE



Lesson 1 - Introduction to Hadoop & Big Data Technologies for Oracle BI/DWMark Rittman, CTO, Rittman Mead SIOUG and HROUG Conferences, Oct 2014

Software

Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW Developers