Unleash Power of Big Data with Informatica for Power Big Data... · Unleash Power of Big Data with Informatica for . ... Cloud Salesforce.com Concur ... Oracle E-BusinessSAS PeopleSoft

Wei Zheng

Senior Director, Product Management

Informatica

Unleash Power of Big Data with

Informatica for

http://hadoop.apache.org/

Agenda

Big Data Overview

What is Hadoop?

Informatica for Hadoop

Getting Data In and Out

Parsing and Preparing Data

Profiling and Discovering Data

Transforming and Cleansing Data

Orchestrating and Monitoring Hadoop

Roadmap

Big Data Overview

What’s happening?

Explosive Growth of Data – Volume, Variety, Velocity

Volume

Source: IDC

Latency Years Sub-Second

Data Volume

Across Time Scales

Bu

sin

ess V

alu

e

Velocity

Variety

http://www.workday.com/index.php

http://www.google.com/imgres?imgurl=http://thecloudtutorial.com/elephant.gif&imgrefurl=http://thecloudtutorial.com/hadoop-tutorial.html&usg=__gMB0dzbV8scian9XqoQcLzF-LkE=&h=473&w=2000&sz=147&hl=en&start=1&sig2=zb1VhqPYhtx0cLN1-dgAgg&zoom=1&itbs=1&tbnid=yH80Nlay6IFDeM:&tbnh=35&tbnw=150&prev=/images?q=hadoop&hl=en&gbv=2&tbs=isch:1&ei=Z9ZFTdDhFI2isAP1vbSKCg

http://www.google.com/imgres?imgurl=http://3.bp.blogspot.com/_aN1WwtpRi5c/TJeCAiUwz9I/AAAAAAAAAFE/CXmGPn0KOQA/s1600/NetezzaFullLogo.jpg&imgrefurl=http://selmark.blogspot.com/2010/09/ibm-to-acquire-netezza.html&usg=__GYYOaZU9yUgJ0cI5s1mHpA6FXGc=&h=356&w=1506&sz=60&hl=en&start=1&sig2=6u8-CxZ1v-tm_EgFabHOBQ&zoom=1&itbs=1&tbnid=Iw5k7cbxZGDkcM:&tbnh=35&tbnw=150&prev=/images?q=netezza&hl=en&gbv=2&tbs=isch:1&ei=pNZFTc3PM4S8sAOIsISHCg

http://www.greenplum.com/

http://www.hyperion.de/

http://www.peoplesoft.com/corp/en/public_index.asp

http://aws.amazon.com/

Big Data

Confluence of Big Transaction, Big Interaction & Big Data Processing

Online

Transaction

Processing

(OLTP)

Online Analytical

Processing

(OLAP) &

DW Appliances

Social

Media Data

Device

Sensor Data

Scientific, genomic

Machine/Device

BIG TRANSACTION DATA BIG INTERACTION DATA

BIG DATA PROCESSING

Call detail

records, image,

click stream data

BIG DATA INTEGRATION

Cloud

Salesforce.com

Concur

Google App Engine

Amazon

…

What is Hadoop?

What is Hadoop?

Distribution Example: Cloudera (CDH 3.0)

Hadoop

Distributed File

System (HDFS)

File Sharing & Data

Protection Across

Physical Servers

MapReduce

Distributed Computing

Across Physical Servers

Hadoop is a big data platform for data

storage and processing that is…

Scalable

Fault tolerant

Open source

CORE HADOOP COMPONENTS

Coordination

Data

Integration

Fast

Read/Write

Access

Languages / Compilers

Workflow Scheduling Metadata

APACHE

ZOOKEEPER

APACHE FLUME,

APACHE SQOOP APACHE HBASE

APACHE PIG, APACHE HIVE

APACHE OOZIE APACHE OOZIE APACHE HIVE

File System

Mount UI Framework SDK

FUSE-DFS HUE HUE SDK 1. System Shall Manage and Heal Itself

2. Performance Shall Scale Linearly

3. Compute Shall Move to Data

4. Simple Core, Modular and Extensible

Hadoop Design Axioms

Hadoop Distributions

What can Hadoop Help You With?

Improve

Decisions

Modernize

Business

Improve

Efficiency

& Reduce

Costs

Mergers

Acquisitions

&

Divestitures

Acquire &

Retain

Customers

Outsource

Non-core

Functions

Governance

Risk

Compliance

Increase

Partner

Network

Efficiency

Increase

Business

Agility

Increase Value of Big Data

Relevant Actionable Timely Holistic Trustworthy Accessible Authoritative Secure

Lower Cost of Big Data

Business Costs Labor Costs Software Costs Hardware Costs Storage Costs

On-Premise Transactions Desktops Mobile Cloud Interactions

Predictive Analytics

(Recommendations,

Outcomes, MRO)

Customer Analytics

(Customer Sentiment,

& Satisfaction)

Pattern Recognition

(Fraud Detection

Risk & Portfolio

Analysis

Optimization

(Pricing, Supply

Chain)

Informatica for Hadoop

Unleash the Power of Hadoop With Informatica

9.5.1

Available Now

Sales & Marketing

Data mart

Customer Service

Portal

Product & Service Offerings Customer Profile Social Media Account Transactions Customer Service Logs & Surveys Marketing Campaigns

3. Parse & Prepare Data in Hadoop

(MapReduce)

1. Ingest Data into Hadoop

4. Transform & Cleanse/Standardize Data

in Hadoop (MapReduce)

Monitor

& M

an

ag

e (

Ha

do

op

or

no

n H

adoop jobs/p

rocesses)

Orc

hestr

ate

Work

flow

s (

Hadoop

or

non

Hadoop

jobs/p

rocesses)

6. Extract Data from Hadoop

2. Discover Hadoop data for anomalies,

relationships and domain types

5. Invoke Custom Business Analytics on

Hadoop

Pro

file

Data

• Repeatability

• Predictable, repeatable deployments and methodology

• Isolation from rapid Hadoop changes

• Frequent new versions and projects

• Avoiding placing bets on the wrong technology

• Reuse of existing assets

• Apply existing integration logic to load data to/from Hadoop

• Reuse existing data quality rules to validate Hadoop data

• Reuse of existing skills

• Enable ETL developers to leverage the power of Hadoop

• Governance

• Enforce and validate data security, data quality and

regulatory policies

Why Informatica? What are the Benefits?

Get Data Into and Out of Hadoop

PowerExchange for Hadoop

hStream with MapR

Data Archiving for Hadoop

Replication for Hadoop

Data Ingestion and Extraction

Moving tens of terabytes per hour of transaction, interaction

and streaming data

Data

Warehouse

MDM

Applications

Transactions,

OLTP, OLAP

Social Media,

Web Logs

Documents,

Email

Industry

Standards

Machine Device,

Scientific

Replicate

Stream

Batch Load

Extract

Archive Extract

Low

Cost

Store

Unleash the Power of Hadoop With High Performance Universal Data Access

WebSphere MQ JMS MSMQ SAP NetWeaver XI

JD Edwards Lotus Notes Oracle E-Business PeopleSoft

Oracle DB2 UDB DB2/400 SQL Server Sybase

ADABAS Datacom DB2 IDMS IMS

Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP

Informix Teradata Netezza ODBC JDBC

VSAM C-ISAM Binary Flat Files Tape Formats…

Web Services TIBCO webMethods

SAP NetWeaver SAP NetWeaver BI SAS Siebel

Messaging, and Web Services

Relational and Flat Files

Mainframe and Midrange

Unstructured Data and Files

Flat files ASCII reports HTML RPG ANSI LDAP

EDI–X12

EDI-Fact

RosettaNet

HL7

HIPAA

ebXML

HL7 v3.0

ACORD (AL3, XML)

XML

LegalXML

IFX

cXML

AST

FIX

Cargo IMP

MVR

Salesforce CRM

Force.com

RightNow

NetSuite

ADP Hewitt SAP By Design Oracle OnDemand

Packaged Applications

Industry Standards

XML Standards

SaaS/BPO

Social Media

Facebook Twitter

LinkedIn EMC/Greenplum Vertica

AsterData

MPP Appliances

http://www.salesforce.com/

PowerExchange for Hadoop

HDFS and Hive Adapters

Support pushdown of source and target connections to ensure maximum performance and scale

Native HDFS and Hive Source/Target Support Integrated development

environment with metadata and preview support

Perform any pre processing needed before ingestion

hStream with MapR – Continuous Ingestion

Transactions,

OLTP, OLAP

Social Media,

Web Logs

Documents,

Email

Industry

Standards

Machine Device,

Scientific

Informatica Ultra Messaging

Streaming Data Continuously

Ne

two

rk F

ile S

yste

m (

NF

S)

Informatica Data Archive

Archiving to Hadoop

Production

Data

Optimized File Archive

Stored on Hadoop File System

• Archive data to optimized file format

for storage reduction

• Compressed (up to 90%)

• Immutable

• Accessible (SQL, ODBC, JDBC)

Informatica Data Archive

Archiving from Hadoop

File Archive

Parse and Prepare Data On

Hadoop

hParser

Informatica Hparser

Tackling Diversity of Big Data

设备/传感器

科学

scientific

Flat Files &

Documents Interaction data Industry Standards XML

The broadest coverage for Big Data

^/>Delimited<\^

Positional

Name = Value

social

Device/sensor

scientific

Productivity

• Visual parsing environment

• Predefined translations

Any DI/BI architecture

PIG EDW

MDM

http://images.google.com/imgres?imgurl=http://www.diversityinc.com/public/images/dtcc.gif&imgrefurl=http://careers.diversityinc.com/careers/employerDirectory/detail?compId=180950&letter=&usg=__pW0aPzjvS35x5rUAaJNKXkMkjuw=&h=71&w=188&sz=3&hl=en&start=7&tbnid=PkwPgiV8EgoHuM:&tbnh=39&tbnw=102&prev=/images?q=DTCC&hl=en&rls=com.microsoft:en-us

http://images.google.com/imgres?imgurl=http://www.rxkey.com/images/NCPDP_logo.jpg&imgrefurl=http://www.rxkey.com/industry.asp&usg=__AArgshs1_xHMOZImZwImX9bpWpc=&h=91&w=143&sz=15&hl=en&start=15&tbnid=j_Zoxk8dBqaviM:&tbnh=60&tbnw=94&prev=/images?q=NCPDP&hl=en&rls=com.microsoft:en-us

http://www.google.com/imgres?imgurl=http://thecloudtutorial.com/elephant.gif&imgrefurl=http://thecloudtutorial.com/hadoop-tutorial.html&usg=__gMB0dzbV8scian9XqoQcLzF-LkE=&h=473&w=2000&sz=147&hl=en&start=1&sig2=zb1VhqPYhtx0cLN1-dgAgg&zoom=1&itbs=1&tbnid=yH80Nlay6IFDeM:&tbnh=35&tbnw=150&prev=/images?q=hadoop&hl=en&gbv=2&tbs=isch:1&ei=Z9ZFTdDhFI2isAP1vbSKCg

Parse and Prepare Data on Hadoop

How does it work? hadoop … dt-hadoop.jar

… My_Parser /input/*/input*.txt

1. Define parser in HParser visual

studio

2. Deploy the parser on Hadoop

Distributed File System (HDFS)

3. Run HParser to extract data and

produce tabular format in

Hadoop

SWIFT MT

SWIFT MX

NACHA

FIX

Telekurs

FpML

BAI – V2.0\Lockbox

CREST DEX

IFX

TWIST

UNIFI (ISO 20022)

SEPA

FIXML

MISMO

B2B Standards

UN\EDIFACT

EDI-X12

EDI ARR

EDI UCS+WINS

EDI VICS

RosettaNet

OAGI

Financial

Healthcare

HL7

HL7 V3

HIPAA

NCPDP

CDISC

Insurance

DTCC-NSCC

ACORD-AL3

ACORD XML

IATA-PADIS

PLMXML

NEIM

Other

Easy example based visual enhancements and edits

Easy example based visual enhancements and edits

Enhanced Validations

Informatica HParser

Productivity: Data Transformation Studio

Out of the box transformations for all messages in all versions

Updates and new versions delivered from Informatica

Why Hadoop?

• Extremely large data sets

• Often information is split

across multi files

• Not sure what are we

looking for

An hParser Example

Proprietary web logs

Profiling and Discovering Data

Informatica Profiling for Hadoop

Discovery of Hadoop Issues/Anomalies

Repository

Informatica

Map-R

educe

Hadoop

Create/Run profile to discover Hadoop data attributes

Profile auto-converted to Hadoop queries/code (Hive, MapReduce, etc.)

Executed natively on Hadoop

Import metadata via native connectivity to Hadoop (Hive, HDFS, Hbase, etc.)

Review and share results via

browser or Eclipse clients • Single table/data object

• Cross table/data object

• Data Domain Discovery

HIVE

HDFS

HBase

1

3

2

beta

Hadoop Data Profiling Results

CUSTOMER_ID example

COUNTRY CODE example

3. Drilldown Analysis (into Hadoop Data)

2. Value &

Pattern

Analysis of

Hadoop Data

1. Profiling Stats: Min/Max Values, NULLs,

Inferred Data Types, etc.

ZIP CODE example

Drill down into actual

data values to inspect

results across entire data

set, including potential

duplicates

Value and Pattern

Frequency to isolated

inconsistent/dirty data or

unexpected patterns

Hadoop Data Profiling

results – exposed to

anyone in enterprise via

browser

Stats to identify

outliers and

anomalies in data

beta

Hadoop Data Domain Discovery

Finding functional meaning of Hadoop Data

1. Leverage INFA rules/mapplets to identify

functional meaning of Hadoop data

2. Sensitive data (e.g. SSN, Credit Card number,

etc.)

3. Liability and Compliance risk?

PHI: Protected Health Information

PII: Personally Identifiable Information

Scalable to look for/discover ANY Domain type

2. View/share report of data

domains/sensitive data

contained in Hadoop. Ability

to drill down to see suspect

data values. beta

Transforming and Cleansing Data

PowerCenter for Hadoop

Informatica Data Quality for Hadoop

Data Integration and Data Quality

Hadoop MapReduce Processing

Data Node SELECT

T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS

CUSTKEY,

customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME,

nation.N_REGIONKEY

FROM

(

SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx

FROM lineitem

GROUP BY L_ORDERKEY

) T1

JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)

JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY)

JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY)

WHERE nation.N_NAME = 'UNITED STATES'

) T2

INSERT OVERWRITE TABLE TARGET1 SELECT *

INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY, count(ORDERKEY2) GROUP BY

CUSTKEY;

Hive HQL

Informatica Developer 1. Informatica mapping translated to optimized

Hive HQL

2. HQL invokes custom UDF within Informatica

DTM for certain specialized data transformations

3. Optimized HQL translated to MapReduce

4. MapReduce and UDF executed on Hadoop

Data Node

Data Node

Data Nodes

UDF MapReduce

Informatica

Data Transformation Library

beta

Import existing PC artifacts into Hadoop development environment

Validate import logic before the actual import process to ensure compatibility

beta

Reuse and Import PC Metadata for Hadoop

Design integration and quality logic for Hadoop in a graphical and metadata driven environment

Configure where the integration logic should run – Hadoop or Native

beta

Design Mappings as Usual…

View complete generated and pushed down Hive or MR code from Hadoop mappings

beta

View Generated HiveQL

Orchestrating and Monitoring

Hadoop

Informatica Workflow & Administration for Hadoop

Mixed Workflow Orchestration

One workflow running tasks on hadoop and local environments

beta

Monitoring – Hive Query Plan Details

beta

Same hive query available in developer tool.

Monitoring – Hive Query Drilldown to M/R

beta

Traceability to

individual M/R

Jobs. Link to Job

Tracker URLs

View Hive

Query Details

Summary of job tracker

status

• Hadoop GA

(9.5.1 Release)

• Native HDFS and

Hive connectivity

• Integrated parsing

on Hadoop

• Data Integration &

Data Quality push

down execution on

Hadoop

• Data Discovery on

Hadoop

• Mixed workload

orchestration and

administration

Product Roadmap

Cap

ab

ilit

y

• Hadoop Beta

(9.5 Release)

• Native HDFS and Hive

connectivity

• Integrated parsing on

Hadoop

• Data Integration & Data

Quality push down

execution on Hadoop

• Data Discovery on

Hadoop

• Mixed workload

orchestration and

administration

• PowerExchange

for Hadoop

(HDFS and PC)

• Hparser

(including JSON

Parsing)

• Support for parallel

processing of large file

parsing

• Support for parsing of

archived files

• Managed file transfer

• Metadata Manager &

Lineage Integration

• Translation to PIG

support

• Profiling API on Hadoop

(call from Java or M/R)

• Persistence of profiling

stats on Hadoop

• Additional DI & DQ

transformations running

on Hadoop

Available Now 1H, 2012 2H, 2012 1H, 2013

• Hadoop Planned Release

• Beta/Early Access: August – Oct, 2012

• GA: 9.5.1 Release, December 2012

• PowerCenter Big Data Edition – Q3 2012 (Tentative)

• PowerCenter Standard Edition

• Enterprise Grid Option for PowerCenter

• PowerExchange for Hadoop

• PowerExchange for Social Media

• PowerExchange for Data Warehouse Appliance

• hParser

• PowerCenter on Hadoop (Available Dec 2012)

39

When Is It Available?