92
© Copyright 2016. Apps Associates LLC. 1 Big Data Overview & Hadoop for DBA’s Satyendra Pasalapudi Associate Practice Director Apps Associates LLC

Aioug big data and hadoop

Embed Size (px)

Citation preview

Page 1: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 1

Big Data Overview & Hadoop for DBA’s

Satyendra Pasalapudi Associate Practice Director Apps Associates LLC

Page 3: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 3

www.ora-search.com

Page 4: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 4

History of Data Management Systems

Magnetic tape

“flat” (sequential) files

Pre-computer technologies:

Printing press Dewey decimal system Punched cards

Magnetic Disk

IMS

Relational Model defined

Indexed-Sequential Access Mechanism (ISAM)

Network Model

IDMS

ADABAS System R

Oracle V2

Ingres

dBase

DB2

Informix

Sybase

SQL Server

Access

Postgres

MySQL

Cassandra

Hadoop

Vertica

Riak

HBase

Dynamo

MongoDB

Redis

VoltDB

Hana

Neo4J

Aerospike

Hierarchical model

1960-70 1940-50 1950-60 1970-80 1980-90 1990-2000

2000-2010

Page 5: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 5

@dvantages of Cloud

Page 6: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 6

Generational Change for Enterprise (IT)

Cloud supports mission critical workloads ─ 87% of Enterprises use Cloud for Mission

Critical Applications

Cloud use in the enterprise continues to

grow ─ Half of the Enterprises say they will use

cloud for at least 75% of their workloads by 2018

No one cloud fits all

─ More than half (53 %) of enterprises use two(2) to four(4) cloud providers

Source: Verizon 2016 State of the Market: Enterprise Cloud report

Page 7: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 7

Cloud – Probable to Inevitable

GE undergoing most important transformation in 140 year history

─ 9000 Applications to AWS & to 4000 Applications

─ 300 ERPs (two years back) to more manageable

─ 34 Data Centers to 4 Data Centers

By 2020 - US$15b of Software Revenue

Changes ─ People - Reduce Outsourcing

─ Technology - Build Approach for things that matter

─ 20% of Applications in Cloud as of today

─ 70% of Applications by 2020 in Cloud

Source: AWS 2015 Keynote – Oct 6 2015

OOW Keynote with Mark Hurd Oct 26 2015

─ Service Management ─ Network Perimeter ─ Risk Based Security Controls ─ Self Service and Automation ─ Financial Transparency

Page 8: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 8

What is Cloud

Page 9: Aioug  big data and hadoop

The Role of Data

is Changing

Page 10: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 10

Until now, Questions you ask drove Data model

New model is collect as much data as possible – “Data-First Philosophy”

Page 11: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 11

Data is the new raw material for

any business on par with

capital, people, labor

Data is the new raw material for any business on par

with capital, people, labor

Page 12: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 12

Characteristics of Big Data

Page 13: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 13

Cost effectively manage

and analyze

all available data in its

native form

unstructured,

structured, streaming

ERP CRM

RFID

Website

Network Switches

Social Media

Billing

Big data Challenge

Page 14: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 14

Hybrid Cloud Framework

HR FIN

SCOM SALES

PROCUREMENT

PLANNING

DW / BI

Page 15: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 15

Big data Eco System

Page 16: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 16

Not Easy to Get Analytic Value at Fast Enough Pace

1

6

Tool Complexity • Early Hadoop tools only for experts

• Existing BI tools not designed for Hadoop

• Emerging solutions lack broad capabilities

80% effort

typically spent on

evaluating and

preparing data

Data Uncertainty • Not familiar and overwhelming

• Potential value not obvious

• Requires significant manipulation

Overly dependent

on scarce and

highly skilled

resources

Source : Oracle

Page 17: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 17

Informatica Study May 2013

Addressed by Oracle Big Data Discovery

Key Challenges in Managing Big Data

Page 18: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 18

Sample of Big Data Use Cases Today

MEDIA/ ENTERTAINMENT

Viewers / advertising effectiveness Cross Sell

COMMUNICATIONS

Location-based advertising

EDUCATION & RESEARCH

Experiment sensor analysis

Retail / CPG

Sentiment analysis Hot products

Optimized Marketing

HEALTH CARE

Patient sensors, monitoring, EHRs Quality of care

LIFE SCIENCES

Clinical trials Genomics

HIGH TECHNOLOGY / INDUSTRIAL MFG.

Mfg quality Warranty analysis

OIL & GAS

Drilling exploration sensor analysis

FINANCIAL SERVICES

Risk & portfolio analysis New products

AUTOMOTIVE

Auto sensors reporting location, problems

Games

Adjust to player behavior In-Game Ads

LAW ENFORCEMENT & DEFENSE

Threat analysis - social media monitoring, photo analysis

TRAVEL & TRANSPORTATION

Sensor analysis for optimal traffic flows Customer sentiment

UTILITIES

Smart Meter analysis for network capacity,

ON-LINE SERVICES / SOCIAL MEDIA

People & career matching Web-site

optimization

What is the main difference in this data?

Volume, Velocity, Variety

These Characteristics Challenge Your Existing Architecture

Page 19: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 19

Big Data Verticals

Media/Advertising

Targeted Advertisin

g

Image and Video Processin

g

Oil & Gas

Seismic Analysis

Retail

Recommend

Transactions

Analysis

Life Sciences

Genome Analysis

Financial Services

Monte Carlo

Simulations

Risk Analysis

Security

Anti-virus

Fraud Detection

Image Recogniti

on

Social Network/Gaming

User Demograp

hics

Usage analysis

In-game metrics

Page 20: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 20

Sample Enterprise Big Data Architecture

Operational RDBMS (Oracle, SQL Server, …)

In-memory Analytics (HANA,

Exalytics …)

In-memory processing

(Spark)

Hadoop

Web DBMS (MySQL, Mongo,

Cassandra)

ERP & in-house CRM

Analytic/BI software (SAS,

Tableau

Web Server Data

Warehouse RDBMS

(Oracle, Teradata …)

Page 21: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 21

Enterprise Data Hub / Data Lake / Data Reservoir

Page 22: Aioug  big data and hadoop

We Need Tools Built Specifically

for Big Data

Page 23: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 23

Hadoop and it’s Eco System

• Scale out Easily

• Parallel Computing

• Commodity Hardware

• Solves some Problems

• Complex to Run

• Special Skills to Maintain

Cassandra

Page 24: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 24

ETL for Unstructured Data

Page 25: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 25

ETL for Structured Data

Page 26: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 26

Hadoop Design Principles

• System shall manage and heal itself

– Automatically and transparently route around failure

– Speculatively execute redundant tasks if certain nodes are detected to be slow

• Performance shall scale linearly

– Proportional change in capacity with resource change

• Compute should move to data

– Lower latency, lower bandwidth

• Simple core, modular and extensible

Page 27: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 27

Hadoop History

• Dec 2004 – Google GFS paper published

• July 2005 – Nutch uses MapReduce

• Feb 2006 – Starts as a Lucene subproject

• Apr 2007 – Yahoo! on 1000-node cluster

• Jan 2008 – An Apache Top Level Project

• Jul 2008 – A 4000 node test cluster

• May 2009 – Hadoop sorts Petabyte in 17 hours

Page 28: Aioug  big data and hadoop

Google File System (GFS)

Map Reduce BigTable

Google Applications

Google Software Architecture (circa 2005)

Page 29: Aioug  big data and hadoop

Start Reduce Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Reduce

Page 30: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 30

Hadoop Ecosystem

HDFS (Hadoop Distributed File System)

HBase (key-value store)

MapReduce (Job Scheduling/Execution System)

Data Access

Sqoop

Flume

Client Access

Hue

Hive(Sql)

Pig(Pl/Sql)

Zo

oK

ee

pe

r

(Coo

rdin

atio

n)

(Streaming/Pipes APIs)

Ch

ukw

a (

Mo

nito

rin

g)

Data Mining

Mahout

OS – Redhat, Suse, Ubuntu,Windows

Commodity Hardware

Java Virtual Machine

Networking

Orchestration

Oozie

Page 31: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 31

Hadoop – Simplified View

• MPP (Massively Parallel) hardware running database-like software

• “Data” is stored in parts, across multiple worker nodes

• “Work” operates in parallel, on the different parts of the table

Controller Worker Nodes

Page 32: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 32

HDFS Architecture

Page 33: Aioug  big data and hadoop

HDFS Architecture

Namenode

B replication

Rack1 Rack2

Client

Blocks

Datanodes Datanodes

Client

Write

Read

Metadata ops Metadata(Name, replicas..) (/home/foo/data,6. ..

Block ops

Page 34: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 34

Head Node Data 1 Data 2 Data 3 Data 4

MYFILE.TXT

..block1 -> block1

..block2 -> block2

..block3 -> block3

HDFS – Highly Available

Page 35: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 35

Namenode and Datanodes

Master/slave architecture

HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.

There are a number of DataNodes usually one per node in a cluster.

The DataNodes manage storage attached to the nodes that they run on.

HDFS exposes a file system namespace and allows user data to be stored in files.

A file is split into one or more blocks and set of blocks are stored in DataNodes.

DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.

Page 36: Aioug  big data and hadoop

Hadoop 1 – Job & Task Trackers

Master Node - The majority of hadoop deployments consist of sevaral master node

instances. Having more than one master node helps eliminate the risk of single

point of failure.

NameNode - These processes are charged with storing a directory tree of all files

in the Hadoop Distributed File SYstem (HDFS). They also keep track of where the

file data is kept within in the cluster. Client Applications contact Name Nodes when

they need to locate a file, or add, or copy or delete a file.

DataNodes - The datanode stores data in the HDFS and is responsible for

replicating data across clusters. Data Nodes interact with client applications when

the NameNopde has supplied the Datanode's address.

WorkerNode: Unlike a master node, whose numbers we can count on one hand, a

representative Hadoop Deployment consists of dozens or hundreds of worker

nodes, which provides enough processing power to analyze a

few hundreds terabytes all the way upto one petabyte. Each worker node includes

a DataNode as well as Task Tracker.

Page 37: Aioug  big data and hadoop

Map Reduce

Job Tracker /MapReduce Workload Management Layer - This

process is assigned to interact with client applications. It is

responsible for distributing MapReduce tasks to particular nodes

within in a cluster. This engine coordinates all aspects of hadoop

such as scheduling and launching jobs.

Task Tracker - This is a process in the cluster that is capable of

receiving tasks( inlcuding Map, Reduce, and Shuffle) from a Job

Tracker

Page 38: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 38

Data Replication Similar to that of ASM

HDFS is designed to store very large files across machines in a large cluster.

Each file is a sequence of blocks.

All blocks in the file except the last are of the same size.

Blocks are replicated for fault tolerance.

Block size and replicas are configurable per file.

The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.

BlockReport contains all the blocks on a Datanode.

Page 39: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 39

Replica Placement & Rack Aware

The placement of the replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from other distributed file systems. Rack-aware replica placement:

Goal: improve reliability, availability and network bandwidth utilization

Many racks, communication between racks are through switches. Network bandwidth between machines on the same rack is greater than those in different racks. Namenode determines the rack id for each DataNode. Replicas are typically placed on unique racks

Simple but non-optimal Writes are expensive Replication factor is 3

Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack.

Page 40: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 40

Replica Selection

• Replica selection for READ operation: HDFS tries to minimize the bandwidth consumption and latency.

• If there is a replica on the Reader node then that is preferred.

• HDFS cluster may span multiple data centers: replica in the local data center is preferred over the remote one.

Page 41: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 41

Hadoop Components

• Hadoop is bundled with two independent components

– HDFS (Hadoop Distributed File System)

• Designed for scaling in terms of storage and IO bandwidth

– MR framework (MapReduce)

• Designed for scaling in terms of performance

Page 42: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 42

Understanding file structure

1 GB file

File is split into

blocks

Each block is typically 64MB

Each block is stored as two files – one holding

data and second for metadata, checksum

Bloc

k

Page 43: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 43

Hadoop Processes

• Processes running on Hadoop

– NameNode

– DataNode

– Secondary NameNode

– Task Tracker

– Job Tracker

Page 44: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 44

NameNode

• Single point of contact

• HDFS master

• Holds meta information

– List of files and directories

– Location of blocks

• Single node per cluster

– Cluster can have thousands of DataNodes and tens of thousands of HDFS client.

NameNode

Page 45: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 45

DataNode

• Can execute multiple tasks concurrently

• Holds actual data blocks, checksum and generation stamp

• If block is half full, needs only half of the space of full block

• At start-up, connects to NameNode and perform handshake

• No binding to IP address or port, uses Storage ID

• Sends heartbeat to NameNode

DataNode Storage ID:

XYZ001

Page 46: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 46

Communication

• Total Storage Capacity

• Fraction of storage in use

• No of data transfer currently

in progress

• Instructs DataNode

• Replicate block to other node

• Remove local block replica

• Send immediate block report

• Shut down the node

Every 3 seconds.

“I AM ALIVE”

NameNod

e

DataNode Storage ID:

XYZ001 DataNode Storage ID:

XYZ002

DataNode Storage ID:

XYZ003

Reply

No heartbeat for 10 minutes

Heartbeat

Page 47: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 47

Page 48: Aioug  big data and hadoop

Coordination in a distributed system

• Coordination: An act that multiple nodes must perform together.

• Examples:

– Group membership

– Locking

– Publisher/Subscriber

– Leader Election

– Synchronization

• Getting node coordination correct is very hard!

Page 49: Aioug  big data and hadoop
Page 50: Aioug  big data and hadoop

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers.

Introducing ZooKeeper

- ZooKeeper Wiki

ZooKeeper is much more than a

distributed lock server!

Page 51: Aioug  big data and hadoop

What is ZooKeeper?

• An open source, high-performance coordination service for distributed applications.

• Exposes common services in simple interface: – naming

– configuration management

– locks & synchronization

– group services

… developers don't have to write them from scratch

• Build your own on it for specific needs.

Page 52: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 52

HDFS Distributions

Page 53: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 53

Real Time BI

• Speed, agility, and intelligence are competitive advantages that nearly all organizations seek.

• Existing Traditional Reporting Systems provide information after 24 – 36 hours.

• To support Operational Users and influence what should happen next, the data should be available in real time to know what is happening now.

Page 55: Aioug  big data and hadoop

2009 2006

1 ° ° ° ° °

° ° ° ° ° N

HDFS (Hadoop Distributed File System)

MapReduce Largely Batch Processing

Hadoop w/ MapReduce

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

° N

HDFS (Hadoop Distributed File System)

Hadoop2 & YARN based Architecture

Silo’d clusters

Largely batch system

Difficult to integrate

MR-279: YARN

Hadoop 2 & YARN

Interactive Real-Time Batch

Enabled the

Modern Data

Architecture

October 23, 2013

Page 56: Aioug  big data and hadoop

© Copyright 2015. Apps Associates LLC. 56

Hadoop 2.0

Multi Use Data Platform

Batch, Interactive, Realtime, Online, Streaming, …

HADOOP 2

Redundant, Reliable Storage (HDFS)

Efficient Cluster Resource Management & Shared Services

(YARN)

Standard Query Processing

Hive

Batch MapReduce

Online Data Processing

Interactive Tez

Real Time Stream Processing

Others

Page 57: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 57

Hadoop 2.0 with YARN

Page 58: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 58

Resource Manager/Node Manager Components

Page 59: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 59

Problems with this approach in Hadoop 1.0

It limits scalability: JobTracker runs on single machine doing several task like

1) Resource management

2) Job and task scheduling and

3) Monitoring

Although there are so many machines (DataNode) available; they are not getting used. This limits scalability.

Availability Issue: In Hadoop 1.0, JobTracker is single Point of availability. This means if JobTracker fails, all jobs must restart.

Distinct map slots and reduce slots

Limitation in running non-MapReduce Application

Page 60: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 60

Yarn Architecture

Rescource Manager:

Arbitrates division of resources among all the applications in the system. The Resource Manager has a pluggable scheduler component, which is responsible for allocating resources to the various running applications

Node Manager:

per-machine slave, runs on slave nodes, which is responsible for launching the applications’ containers, monitoring their resource usage (CPU, memory, disk, network),and reporting the same to the Resource Manager.

Application Master:

Negotiate appropriate resource containers from the Scheduler, tracking their status and monitoring for progress

Container:

Unit of allocation incorporating resource elements such as memory, cpu, disk, network etc, to execute a specific task of the application (similar to map/reduce slots in MRv1)

Page 61: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 61

Yarn - Execution Sequence

1) A client program submits the application

2) ResourceManager allocates a specified container to start the ApplicationMaster

3) ApplicationMaster, on boot-up, registers with ResourceManager

4) ApplicationMaster negotiates with ResourceManager for appropriate resource containers

5) On successful container allocations, ApplicationMaster contacts NodeManager to launch the container

6) Application code is executed within the container, and then ApplicationMaster is responded with the execution status

7) During execution, the client communicates directly with ApplicationMaster or ResourceManager to get status, progress updates etc.

8) Once the application is complete, ApplicationMaster unregisters with ResourceManager and shuts down, allowing its own container process

Page 62: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 62

Operational vs. Analytical Databases

Page 63: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 63

A New Technology

Page 64: Aioug  big data and hadoop

No Means Yes!

Page 65: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 65

Use Cases

Page 66: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 66

Brewer's CAP Theorem

Page 67: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 67

Brewer's CAP Theorem

Page 68: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 68

NoSQL Technology Spectrum

Page 69: Aioug  big data and hadoop

Name Site Counter

Dick Ebay 507,018

Dick Google 690,414

Jane Google 716,426

Dick Facebook 723,649

Jane Facebook 643,261

Jane ILoveLarry.com 856,767

Dick MadBillFans.com 675,230

NameId Name

1 Dick

2 Jane

SiteId SiteName

1 Ebay

2 Google

3 Facebook

4 ILoveLarry.com

5 MadBillFans.com

NameId SiteId Counter

1 1 507,018

1 3 690,414

2 3 716,426

1 3 723,649

2 3 643,261

2 4 856,767

1 5 675,230

Id Name Ebay Google Facebook (other columns) MadBillFans.com

1 Dick 507,018 690,414 723,649 . . . . . . . . . . . . . . 675,230

Id Name Google Facebook (other columns) ILoveLarry.com

2 Jane 716,426 643,261 . . . . . . . . . . . . . . 856,767

BigTable Data Model

Page 70: Aioug  big data and hadoop

Document databases

• Structured documents – XML and JSON

(JavaScript Object Notation) become more

prevalent within applications

• Web programmers start storing these in BLOBS in

MySQL

• Emergence of XML and JSON databases

Page 71: Aioug  big data and hadoop

Graph Database

Neo4J

Infinite Graph

FlockDB

Document

JSON based

MongoDB

CouchDB

RethinkDB

XML based

MarkLogic

BerkeleyDB XML

Key Value

MemchacheDB

Oracle NoSQL

Dynamo

Voldemort

DynamoDB

Riak

Table Based BigTable

Cassandra

Hbase

HyperTable

Accumulo

Page 72: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 72

Run the Business

Scale-out and scale-up

Collect any data

SQL

Transactional and analytic applications for the enterprise

Secure and highly available

Relational Hadoop

Change the Business

Scale-out, low cost store

Collect any data

Map-reduce, SQL

Analytic applications

NoSQL

Scale the Business

Scale-out, low cost store

Collect key-value data

Find data by key

Web applications

Multiple Data Stores

Page 73: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 73

Data Analytics Challenge

Separate silos of information to analyze

Page 74: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 74

Data Analytics Challenge

Separate data access interfaces

Page 75: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 75

SQL on Hadoop is Obvious

Stinger

Page 76: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 76

Data Analytics Challenge

No comprehensive SQL interface across Oracle, Hadoop and NoSQL

Page 77: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 77

Oracle Big Data Management System

Rich, comprehensive SQL access to all enterprise data

NoSQL

Page 78: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 78

What Does Unified Query Mean for You?

After

Data Science

???

Anyone

Before

PhD

Page 79: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 79

What Does Unified Query Mean for You?

After

Application Development

Before

Page 80: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 80

Storage Layer

A New Hadoop Processing Engine

Filesystem (HDFS) NoSQL Databases

(Oracle NoSQL DB, Hbase)

Resource Management (YARN)

Processing Layer

MapReduce and Hive

Spark Impala Search Big Data

SQL

Page 81: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 81

Big Data SQL

SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id;

Relevant SQL runs on BDA nodes

10’s of Gigabytes of Data

Only columns and rows needed to answer query are returned

Hadoop Cluster

B B B

Big Data SQL

Oracle Database

CUSTOMERS WEB_LOGS

Page 82: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 82

Big Data SQL

SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id;

Relevant SQL runs on BDA nodes

10’s of Gigabytes of Data

Only columns and rows needed to answer query are returned

Hadoop Cluster

B B B

Big Data SQL

Oracle Database

CUSTOMERS WEB_LOGS

SQL Push Down in Big Data SQL

• Hadoop Scans on Unstructured Data • WHERE Clause Evaluation • Column Projection • Bloom Filters for Better Join Performance • JSON Parsing, Data Mining Model Evaluation

Page 83: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 83

Query All Data without Application Change or Data Conversion

Oracle Big Data SQL

Page 84: Aioug  big data and hadoop

INGEST PROCESS

VISUALIZE

ANALYZE

STORE

High Level Architecture

Page 86: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 86

BDD Value Proposition

Note: company logos and images are for illustration purposes only. Not a real use case for the company.

Page 87: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 87

Oracle BDD - Technical Innovation on Hadoop

Oracle Big Data Discovery Workloads

Hadoop Cluster (BDA or Commodity

Hardware)

BDD node

data node

data node

data node

data node

name node Data Processing, Workflow & Monitoring

• Profiling: catalog entry creation, data type &

language detection, schema configuration • Sampling: dgraph (index) file creation • Transforms: >100 functions • Enrichments: location (geo), text (cleanup,

sentiment, entity, key-phrase, whitelist tagging)

Self-Service Provisioning & Data Transfer

• Personal Data: Upload CSV and XLS to HDFS

In-Memory Discovery Indexes

• DGraph: Search, Guided Navigation, Analytics

Studio

• Web UI: Find, Explore, Transform, Discover, Share

Hadoop 2.x

Filesystem (HDFS)

Workload Mgmt (YARN)

Metadata (HCatalog)

Other Hadoop Workloads

MapReduce

Spark

Hive

Pig

Oracle Big Data SQL (BDA only)

Page 88: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 88

Sample Enterprise Big Data Architecture

Operational RDBMS (Oracle, SQL Server, …)

In-memory Analytics (HANA,

Exalytics …)

In-memory processing

(Spark)

Hadoop

Web DBMS (MySQL, Mongo,

Cassandra)

ERP & in-house CRM

Analytic/BI software (SAS,

Tableau

Web Server Data

Warehouse RDBMS

(Oracle, Teradata …)

Page 89: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 89

Cloud Consultant

Core Skills 50%

Automation 10%

Cloud Knowledge

20%

Tools & Integration

20 % = + + +

How to transition into a Cloud Consultant

Page 90: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 90

Page 91: Aioug  big data and hadoop

Thank You! [email protected]

@pasalapudi

https://community.oracle.com/groups/aioug-social-group

Page 92: Aioug  big data and hadoop

© Copyright 2016. Apps Associates LLC. 92

www.ora-search.com