Aioug big data and hadoop

© Copyright 2016. Apps Associates LLC. 1

Big Data Overview & Hadoop for DBA’s

Satyendra Pasalapudi Associate Practice Director Apps Associates LLC


About Me

Satyendra Kumar Pasalapudi

Associate Practice Director – Infrastructure/Cloud Practice at Apps Associates

Co-Founder & President of All India Oracle Users Group(AIOUG)

@pasalapudi

http://www.twitter.com/pasalapudi



http://www.aioug.org/index.php

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRxqFQoTCLqHncqB3cgCFUQfHgoduegCtg&url=http://viscosityna.com/viscosity-cto-recognized-oracle-ace-director/&psig=AFQjCNHvlC4q7-n4oe2DiZ9GnCLnPgDmUw&ust=1445841509343686


www.ora-search.com


History of Data Management Systems

Magnetic tape

“flat” (sequential) files

Pre-computer technologies:

Printing press Dewey decimal system Punched cards

Magnetic Disk

IMS

Relational Model defined

Indexed-Sequential Access Mechanism (ISAM)

Network Model

IDMS

ADABAS System R

Oracle V2

Ingres

dBase

DB2

Informix

Sybase

SQL Server

Access

Postgres

MySQL

Cassandra

Hadoop

Vertica

Riak

HBase

Dynamo

MongoDB

Redis

VoltDB

Hana

Neo4J

Aerospike

Hierarchical model

1960-70 1940-50 1950-60 1970-80 1980-90 1990-2000

2000-2010


@dvantages of Cloud


Generational Change for Enterprise (IT)

Cloud supports mission critical workloads ─ 87% of Enterprises use Cloud for Mission

Critical Applications

Cloud use in the enterprise continues to

grow ─ Half of the Enterprises say they will use

cloud for at least 75% of their workloads by 2018

No one cloud fits all

─ More than half (53 %) of enterprises use two(2) to four(4) cloud providers

Source: Verizon 2016 State of the Market: Enterprise Cloud report


Cloud – Probable to Inevitable

GE undergoing most important transformation in 140 year history

─ 9000 Applications to AWS & to 4000 Applications

─ 300 ERPs (two years back) to more manageable

─ 34 Data Centers to 4 Data Centers

By 2020 - US$15b of Software Revenue

Changes ─ People - Reduce Outsourcing

─ Technology - Build Approach for things that matter

─ 20% of Applications in Cloud as of today

─ 70% of Applications by 2020 in Cloud

Source: AWS 2015 Keynote – Oct 6 2015

OOW Keynote with Mark Hurd Oct 26 2015

─ Service Management ─ Network Perimeter ─ Risk Based Security Controls ─ Self Service and Automation ─ Financial Transparency


What is Cloud

The Role of Data

is Changing


Until now, Questions you ask drove Data model

New model is collect as much data as possible – “Data-First Philosophy”


Data is the new raw material for

any business on par with

capital, people, labor

Data is the new raw material for any business on par

with capital, people, labor


Characteristics of Big Data


Cost effectively manage

and analyze

all available data in its

native form

unstructured,

structured, streaming

ERP CRM

RFID

Website

Network Switches

Social Media

Billing

Big data Challenge


Hybrid Cloud Framework

HR FIN

SCOM SALES

PROCUREMENT

PLANNING

DW / BI


Big data Eco System


Not Easy to Get Analytic Value at Fast Enough Pace

1

6

Tool Complexity • Early Hadoop tools only for experts

• Existing BI tools not designed for Hadoop

• Emerging solutions lack broad capabilities

80% effort

typically spent on

evaluating and

preparing data

Data Uncertainty • Not familiar and overwhelming

• Potential value not obvious

• Requires significant manipulation

Overly dependent

on scarce and

highly skilled

resources

Source : Oracle


Informatica Study May 2013

Addressed by Oracle Big Data Discovery

Key Challenges in Managing Big Data


Sample of Big Data Use Cases Today

MEDIA/ ENTERTAINMENT

Viewers / advertising effectiveness Cross Sell

COMMUNICATIONS

Location-based advertising

EDUCATION & RESEARCH

Experiment sensor analysis

Retail / CPG

Sentiment analysis Hot products

Optimized Marketing

HEALTH CARE

Patient sensors, monitoring, EHRs Quality of care

LIFE SCIENCES

Clinical trials Genomics

HIGH TECHNOLOGY / INDUSTRIAL MFG.

Mfg quality Warranty analysis

OIL & GAS

Drilling exploration sensor analysis

FINANCIAL SERVICES

Risk & portfolio analysis New products

AUTOMOTIVE

Auto sensors reporting location, problems

Games

Adjust to player behavior In-Game Ads

LAW ENFORCEMENT & DEFENSE

Threat analysis - social media monitoring, photo analysis

TRAVEL & TRANSPORTATION

Sensor analysis for optimal traffic flows Customer sentiment

UTILITIES

Smart Meter analysis for network capacity,

ON-LINE SERVICES / SOCIAL MEDIA

People & career matching Web-site

optimization

What is the main difference in this data?

Volume, Velocity, Variety

These Characteristics Challenge Your Existing Architecture


Big Data Verticals

Media/Advertising

Targeted Advertisin

g

Image and Video Processin

g

Oil & Gas

Seismic Analysis

Retail

Recommend

Transactions

Analysis

Life Sciences

Genome Analysis

Financial Services

Monte Carlo

Simulations

Risk Analysis

Security

Anti-virus

Fraud Detection

Image Recogniti

on

Social Network/Gaming

User Demograp

hics

Usage analysis

In-game metrics


Sample Enterprise Big Data Architecture

Operational RDBMS (Oracle, SQL Server, …)

In-memory Analytics (HANA,

Exalytics …)

In-memory processing

(Spark)

Hadoop

Web DBMS (MySQL, Mongo,

Cassandra)

ERP & in-house CRM

Analytic/BI software (SAS,

Tableau

Web Server Data

Warehouse RDBMS

(Oracle, Teradata …)


Enterprise Data Hub / Data Lake / Data Reservoir

We Need Tools Built Specifically

for Big Data


Hadoop and it’s Eco System

• Scale out Easily

• Parallel Computing

• Commodity Hardware

• Solves some Problems

• Complex to Run

• Special Skills to Maintain

Cassandra


ETL for Unstructured Data


ETL for Structured Data


Hadoop Design Principles

• System shall manage and heal itself

– Automatically and transparently route around failure

– Speculatively execute redundant tasks if certain nodes are detected to be slow

• Performance shall scale linearly

– Proportional change in capacity with resource change

• Compute should move to data

– Lower latency, lower bandwidth

• Simple core, modular and extensible


Hadoop History

• Dec 2004 – Google GFS paper published

• July 2005 – Nutch uses MapReduce

• Feb 2006 – Starts as a Lucene subproject

• Apr 2007 – Yahoo! on 1000-node cluster

• Jan 2008 – An Apache Top Level Project

• Jul 2008 – A 4000 node test cluster

• May 2009 – Hadoop sorts Petabyte in 17 hours

Google File System (GFS)

Map Reduce BigTable

Google Applications

Google Software Architecture (circa 2005)

Start Reduce Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Map

Map Reduce


Hadoop Ecosystem

HDFS (Hadoop Distributed File System)

HBase (key-value store)

MapReduce (Job Scheduling/Execution System)

Data Access

Sqoop

Flume

Client Access

Hue

Hive(Sql)

Pig(Pl/Sql)

Zo

oK

ee

pe

r

(Coo

rdin

atio

n)

(Streaming/Pipes APIs)

Ch

ukw

a (

Mo

nito

rin

g)

Data Mining

Mahout

OS – Redhat, Suse, Ubuntu,Windows

Commodity Hardware

Java Virtual Machine

Networking

Orchestration

Oozie


Hadoop – Simplified View

• MPP (Massively Parallel) hardware running database-like software

• “Data” is stored in parts, across multiple worker nodes

• “Work” operates in parallel, on the different parts of the table

Controller Worker Nodes


HDFS Architecture

HDFS Architecture

Namenode

B replication

Rack1 Rack2

Client

Blocks

Datanodes Datanodes

Client

Write

Read

Metadata ops Metadata(Name, replicas..) (/home/foo/data,6. ..

Block ops


Head Node Data 1 Data 2 Data 3 Data 4

MYFILE.TXT

..block1 -> block1

..block2 -> block2

..block3 -> block3

HDFS – Highly Available


Namenode and Datanodes

Master/slave architecture

HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.

There are a number of DataNodes usually one per node in a cluster.

The DataNodes manage storage attached to the nodes that they run on.

HDFS exposes a file system namespace and allows user data to be stored in files.

A file is split into one or more blocks and set of blocks are stored in DataNodes.

DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.

Hadoop 1 – Job & Task Trackers

Master Node - The majority of hadoop deployments consist of sevaral master node

instances. Having more than one master node helps eliminate the risk of single

point of failure.

NameNode - These processes are charged with storing a directory tree of all files

in the Hadoop Distributed File SYstem (HDFS). They also keep track of where the

file data is kept within in the cluster. Client Applications contact Name Nodes when

they need to locate a file, or add, or copy or delete a file.

DataNodes - The datanode stores data in the HDFS and is responsible for

replicating data across clusters. Data Nodes interact with client applications when

the NameNopde has supplied the Datanode's address.

WorkerNode: Unlike a master node, whose numbers we can count on one hand, a

representative Hadoop Deployment consists of dozens or hundreds of worker

nodes, which provides enough processing power to analyze a

few hundreds terabytes all the way upto one petabyte. Each worker node includes

a DataNode as well as Task Tracker.

Map Reduce

Job Tracker /MapReduce Workload Management Layer - This

process is assigned to interact with client applications. It is

responsible for distributing MapReduce tasks to particular nodes

within in a cluster. This engine coordinates all aspects of hadoop

such as scheduling and launching jobs.

Task Tracker - This is a process in the cluster that is capable of

receiving tasks( inlcuding Map, Reduce, and Shuffle) from a Job

Tracker


Data Replication Similar to that of ASM

HDFS is designed to store very large files across machines in a large cluster.

Each file is a sequence of blocks.

All blocks in the file except the last are of the same size.

Blocks are replicated for fault tolerance.

Block size and replicas are configurable per file.

The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.

BlockReport contains all the blocks on a Datanode.


Replica Placement & Rack Aware

The placement of the replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from other distributed file systems. Rack-aware replica placement:

Goal: improve reliability, availability and network bandwidth utilization

Many racks, communication between racks are through switches. Network bandwidth between machines on the same rack is greater than those in different racks. Namenode determines the rack id for each DataNode. Replicas are typically placed on unique racks

Simple but non-optimal Writes are expensive Replication factor is 3

Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack.


Replica Selection

• Replica selection for READ operation: HDFS tries to minimize the bandwidth consumption and latency.

• If there is a replica on the Reader node then that is preferred.

• HDFS cluster may span multiple data centers: replica in the local data center is preferred over the remote one.


Hadoop Components

• Hadoop is bundled with two independent components

– HDFS (Hadoop Distributed File System)

• Designed for scaling in terms of storage and IO bandwidth

– MR framework (MapReduce)

• Designed for scaling in terms of performance


Understanding file structure

1 GB file

File is split into

blocks

Each block is typically 64MB

Each block is stored as two files – one holding

data and second for metadata, checksum

Bloc

k


Hadoop Processes

• Processes running on Hadoop

– NameNode

– DataNode

– Secondary NameNode

– Task Tracker

– Job Tracker


NameNode

• Single point of contact

• HDFS master

• Holds meta information

– List of files and directories

– Location of blocks

• Single node per cluster

– Cluster can have thousands of DataNodes and tens of thousands of HDFS client.

NameNode


DataNode

• Can execute multiple tasks concurrently

• Holds actual data blocks, checksum and generation stamp

• If block is half full, needs only half of the space of full block

• At start-up, connects to NameNode and perform handshake

• No binding to IP address or port, uses Storage ID

• Sends heartbeat to NameNode

DataNode Storage ID:

XYZ001


Communication

• Total Storage Capacity

• Fraction of storage in use

• No of data transfer currently

in progress

• Instructs DataNode

• Replicate block to other node

• Remove local block replica

• Send immediate block report

• Shut down the node

Every 3 seconds.

“I AM ALIVE”

NameNod

e


XYZ001 DataNode Storage ID:

XYZ002


XYZ003

Reply

No heartbeat for 10 minutes

Heartbeat


Coordination in a distributed system

• Coordination: An act that multiple nodes must perform together.

• Examples:

– Group membership

– Locking

– Publisher/Subscriber

– Leader Election

– Synchronization

• Getting node coordination correct is very hard!

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers.

Introducing ZooKeeper

- ZooKeeper Wiki

ZooKeeper is much more than a

distributed lock server!

What is ZooKeeper?

• An open source, high-performance coordination service for distributed applications.

• Exposes common services in simple interface: – naming

– configuration management

– locks & synchronization

– group services

… developers don't have to write them from scratch

• Build your own on it for specific needs.


HDFS Distributions


Real Time BI

• Speed, agility, and intelligence are competitive advantages that nearly all organizations seek.

• Existing Traditional Reporting Systems provide information after 24 – 36 hours.

• To support Operational Users and influence what should happen next, the data should be available in real time to know what is happening now.


Hadoop 2.0

http://www.google.co.in/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=A2ql2vprIy1-9M&tbnid=KXM_k3zvnu_14M:&ved=0CAUQjRw&url=http://hortonworks.com/hadoop/yarn/&ei=zyZ9UqzXJ8mWiAfR_IGgCQ&bvm=bv.56146854,d.aGc&psig=AFQjCNHvWs4AyHht_VRSMJ42zphHBU-kRw&ust=1384020037430301

2009 2006

1 ° ° ° ° °

° ° ° ° ° N


MapReduce Largely Batch Processing

Hadoop w/ MapReduce

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

° N


Hadoop2 & YARN based Architecture

Silo’d clusters

Largely batch system

Difficult to integrate

MR-279: YARN

Hadoop 2 & YARN

Interactive Real-Time Batch

Enabled the

Modern Data

Architecture

October 23, 2013


Hadoop 2.0

Multi Use Data Platform

Batch, Interactive, Realtime, Online, Streaming, …

HADOOP 2

Redundant, Reliable Storage (HDFS)

Efficient Cluster Resource Management & Shared Services

(YARN)

Standard Query Processing

Hive

Batch MapReduce

Online Data Processing

Interactive Tez

Real Time Stream Processing

Others


Hadoop 2.0 with YARN


Resource Manager/Node Manager Components


Problems with this approach in Hadoop 1.0

It limits scalability: JobTracker runs on single machine doing several task like

1) Resource management

2) Job and task scheduling and

3) Monitoring

Although there are so many machines (DataNode) available; they are not getting used. This limits scalability.

Availability Issue: In Hadoop 1.0, JobTracker is single Point of availability. This means if JobTracker fails, all jobs must restart.

Distinct map slots and reduce slots

Limitation in running non-MapReduce Application


Yarn Architecture

Rescource Manager:

Arbitrates division of resources among all the applications in the system. The Resource Manager has a pluggable scheduler component, which is responsible for allocating resources to the various running applications

Node Manager:

per-machine slave, runs on slave nodes, which is responsible for launching the applications’ containers, monitoring their resource usage (CPU, memory, disk, network),and reporting the same to the Resource Manager.

Application Master:

Negotiate appropriate resource containers from the Scheduler, tracking their status and monitoring for progress

Container:

Unit of allocation incorporating resource elements such as memory, cpu, disk, network etc, to execute a specific task of the application (similar to map/reduce slots in MRv1)


Yarn - Execution Sequence

1) A client program submits the application

2) ResourceManager allocates a specified container to start the ApplicationMaster

3) ApplicationMaster, on boot-up, registers with ResourceManager

4) ApplicationMaster negotiates with ResourceManager for appropriate resource containers

5) On successful container allocations, ApplicationMaster contacts NodeManager to launch the container

6) Application code is executed within the container, and then ApplicationMaster is responded with the execution status

7) During execution, the client communicates directly with ApplicationMaster or ResourceManager to get status, progress updates etc.

8) Once the application is complete, ApplicationMaster unregisters with ResourceManager and shuts down, allowing its own container process


Operational vs. Analytical Databases


A New Technology

No Means Yes!


Use Cases


Brewer's CAP Theorem


Brewer's CAP Theorem


NoSQL Technology Spectrum

Name Site Counter

Dick Ebay 507,018

Dick Google 690,414

Jane Google 716,426

Dick Facebook 723,649

Jane Facebook 643,261

Jane ILoveLarry.com 856,767

Dick MadBillFans.com 675,230

NameId Name

1 Dick

2 Jane

SiteId SiteName

1 Ebay

2 Google

3 Facebook

4 ILoveLarry.com

5 MadBillFans.com

NameId SiteId Counter

1 1 507,018

1 3 690,414

2 3 716,426

1 3 723,649

2 3 643,261

2 4 856,767

1 5 675,230

Id Name Ebay Google Facebook (other columns) MadBillFans.com

1 Dick 507,018 690,414 723,649 . . . . . . . . . . . . . . 675,230

Id Name Google Facebook (other columns) ILoveLarry.com

2 Jane 716,426 643,261 . . . . . . . . . . . . . . 856,767

BigTable Data Model

Document databases

• Structured documents – XML and JSON

(JavaScript Object Notation) become more

prevalent within applications

• Web programmers start storing these in BLOBS in

MySQL

• Emergence of XML and JSON databases

Graph Database

Neo4J

Infinite Graph

FlockDB

Document

JSON based

MongoDB

CouchDB

RethinkDB

XML based

MarkLogic

BerkeleyDB XML

Key Value

MemchacheDB

Oracle NoSQL

Dynamo

Voldemort

DynamoDB

Riak

Table Based BigTable

Cassandra

Hbase

HyperTable

Accumulo


Run the Business

Scale-out and scale-up

Collect any data

SQL

Transactional and analytic applications for the enterprise

Secure and highly available

Relational Hadoop

Change the Business

Scale-out, low cost store

Collect any data

Map-reduce, SQL

Analytic applications

NoSQL

Scale the Business

Scale-out, low cost store

Collect key-value data

Find data by key

Web applications

Multiple Data Stores


Data Analytics Challenge

Separate silos of information to analyze



Separate data access interfaces


SQL on Hadoop is Obvious

Stinger



No comprehensive SQL interface across Oracle, Hadoop and NoSQL


Oracle Big Data Management System

Rich, comprehensive SQL access to all enterprise data

NoSQL


What Does Unified Query Mean for You?

After

Data Science

???

Anyone

Before

PhD


What Does Unified Query Mean for You?

After

Application Development

Before


Storage Layer

A New Hadoop Processing Engine

Filesystem (HDFS) NoSQL Databases

(Oracle NoSQL DB, Hbase)

Resource Management (YARN)

Processing Layer

MapReduce and Hive

Spark Impala Search Big Data

SQL


Big Data SQL

SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id;

Relevant SQL runs on BDA nodes

10’s of Gigabytes of Data

Only columns and rows needed to answer query are returned

Hadoop Cluster

B B B

Big Data SQL

Oracle Database

CUSTOMERS WEB_LOGS


Big Data SQL

SELECT w.sess_id, c.name FROM web_logs w, customers c WHERE w.source_country = ‘Brazil’ AND w.cust_id = c.customer_id;

Relevant SQL runs on BDA nodes

10’s of Gigabytes of Data

Only columns and rows needed to answer query are returned

Hadoop Cluster

B B B

Big Data SQL

Oracle Database

CUSTOMERS WEB_LOGS

SQL Push Down in Big Data SQL

• Hadoop Scans on Unstructured Data • WHERE Clause Evaluation • Column Projection • Bloom Filters for Better Join Performance • JSON Parsing, Data Mining Model Evaluation


Query All Data without Application Change or Data Conversion

Oracle Big Data SQL

INGEST PROCESS

VISUALIZE

ANALYZE

STORE

High Level Architecture


Fast Pace Innovation

Dec 18th 2015

http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at











BDD Value Proposition

Note: company logos and images are for illustration purposes only. Not a real use case for the company.


Oracle BDD - Technical Innovation on Hadoop

Oracle Big Data Discovery Workloads

Hadoop Cluster (BDA or Commodity

Hardware)

BDD node

data node

data node

data node

data node

name node Data Processing, Workflow & Monitoring

• Profiling: catalog entry creation, data type &

language detection, schema configuration • Sampling: dgraph (index) file creation • Transforms: >100 functions • Enrichments: location (geo), text (cleanup,

sentiment, entity, key-phrase, whitelist tagging)

Self-Service Provisioning & Data Transfer

• Personal Data: Upload CSV and XLS to HDFS

In-Memory Discovery Indexes

• DGraph: Search, Guided Navigation, Analytics

Studio

• Web UI: Find, Explore, Transform, Discover, Share

Hadoop 2.x

Filesystem (HDFS)

Workload Mgmt (YARN)

Metadata (HCatalog)

Other Hadoop Workloads

MapReduce

Spark

Hive

Pig

Oracle Big Data SQL (BDA only)


Sample Enterprise Big Data Architecture

Operational RDBMS (Oracle, SQL Server, …)

In-memory Analytics (HANA,

Exalytics …)

In-memory processing

(Spark)

Hadoop

Web DBMS (MySQL, Mongo,

Cassandra)

ERP & in-house CRM

Analytic/BI software (SAS,

Tableau

Web Server Data

Warehouse RDBMS

(Oracle, Teradata …)


Cloud Consultant

Core Skills 50%

Automation 10%

Cloud Knowledge

20%

Tools & Integration

20 % = + + +

How to transition into a Cloud Consultant


Thank You! [email protected]

@pasalapudi

https://community.oracle.com/groups/aioug-social-group


www.ora-search.com