Upload
vandat
View
230
Download
8
Embed Size (px)
Citation preview
Wei Zheng
Senior Director, Product Management
Informatica
Unleash Power of Big Data with
Informatica for
Agenda
Big Data Overview
What is Hadoop?
Informatica for Hadoop
Getting Data In and Out
Parsing and Preparing Data
Profiling and Discovering Data
Transforming and Cleansing Data
Orchestrating and Monitoring Hadoop
Roadmap
Big Data Overview
What’s happening?
Explosive Growth of Data – Volume, Variety, Velocity
Volume
Source: IDC
Latency Years Sub-Second
Data Volume
Across Time Scales
Bu
sin
ess V
alu
e
Velocity
Variety
Big Data
Confluence of Big Transaction, Big Interaction & Big Data Processing
Online
Transaction
Processing
(OLTP)
Online Analytical
Processing
(OLAP) &
DW Appliances
Social
Media Data
Device
Sensor Data
Scientific, genomic
Machine/Device
BIG TRANSACTION DATA BIG INTERACTION DATA
BIG DATA PROCESSING
Call detail
records, image,
click stream data
BIG DATA INTEGRATION
Cloud
Salesforce.com
Concur
Google App Engine
Amazon
…
What is Hadoop?
What is Hadoop?
Distribution Example: Cloudera (CDH 3.0)
Hadoop
Distributed File
System (HDFS)
File Sharing & Data
Protection Across
Physical Servers
MapReduce
Distributed Computing
Across Physical Servers
Hadoop is a big data platform for data
storage and processing that is…
Scalable
Fault tolerant
Open source
CORE HADOOP COMPONENTS
Coordination
Data
Integration
Fast
Read/Write
Access
Languages / Compilers
Workflow Scheduling Metadata
APACHE
ZOOKEEPER
APACHE FLUME,
APACHE SQOOP APACHE HBASE
APACHE PIG, APACHE HIVE
APACHE OOZIE APACHE OOZIE APACHE HIVE
File System
Mount UI Framework SDK
FUSE-DFS HUE HUE SDK 1. System Shall Manage and Heal Itself
2. Performance Shall Scale Linearly
3. Compute Shall Move to Data
4. Simple Core, Modular and Extensible
Hadoop Design Axioms
Hadoop Distributions
What can Hadoop Help You With?
Improve
Decisions
Modernize
Business
Improve
Efficiency
& Reduce
Costs
Mergers
Acquisitions
&
Divestitures
Acquire &
Retain
Customers
Outsource
Non-core
Functions
Governance
Risk
Compliance
Increase
Partner
Network
Efficiency
Increase
Business
Agility
Increase Value of Big Data
Relevant Actionable Timely Holistic Trustworthy Accessible Authoritative Secure
Lower Cost of Big Data
Business Costs Labor Costs Software Costs Hardware Costs Storage Costs
On-Premise Transactions Desktops Mobile Cloud Interactions
Predictive Analytics
(Recommendations,
Outcomes, MRO)
Customer Analytics
(Customer Sentiment,
& Satisfaction)
Pattern Recognition
(Fraud Detection
Risk & Portfolio
Analysis
Optimization
(Pricing, Supply
Chain)
Informatica for Hadoop
Unleash the Power of Hadoop With Informatica
9.5.1
Available Now
Sales & Marketing
Data mart
Customer Service
Portal
Product & Service Offerings Customer Profile Social Media Account Transactions Customer Service Logs & Surveys Marketing Campaigns
3. Parse & Prepare Data in Hadoop
(MapReduce)
1. Ingest Data into Hadoop
4. Transform & Cleanse/Standardize Data
in Hadoop (MapReduce)
Monitor
& M
an
ag
e (
Ha
do
op
or
no
n H
adoop jobs/p
rocesses)
Orc
hestr
ate
Work
flow
s (
Hadoop
or
non
Hadoop
jobs/p
rocesses)
6. Extract Data from Hadoop
2. Discover Hadoop data for anomalies,
relationships and domain types
5. Invoke Custom Business Analytics on
Hadoop
Pro
file
Data
• Repeatability
• Predictable, repeatable deployments and methodology
• Isolation from rapid Hadoop changes
• Frequent new versions and projects
• Avoiding placing bets on the wrong technology
• Reuse of existing assets
• Apply existing integration logic to load data to/from Hadoop
• Reuse existing data quality rules to validate Hadoop data
• Reuse of existing skills
• Enable ETL developers to leverage the power of Hadoop
• Governance
• Enforce and validate data security, data quality and
regulatory policies
Why Informatica? What are the Benefits?
Get Data Into and Out of Hadoop
PowerExchange for Hadoop
hStream with MapR
Data Archiving for Hadoop
Replication for Hadoop
Data Ingestion and Extraction
Moving tens of terabytes per hour of transaction, interaction
and streaming data
Data
Warehouse
MDM
Applications
Transactions,
OLTP, OLAP
Social Media,
Web Logs
Documents,
Industry
Standards
Machine Device,
Scientific
Replicate
Stream
Batch Load
Extract
Archive Extract
Low
Cost
Store
Unleash the Power of Hadoop With High Performance Universal Data Access
WebSphere MQ JMS MSMQ SAP NetWeaver XI
JD Edwards Lotus Notes Oracle E-Business PeopleSoft
Oracle DB2 UDB DB2/400 SQL Server Sybase
ADABAS Datacom DB2 IDMS IMS
Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP
Informix Teradata Netezza ODBC JDBC
VSAM C-ISAM Binary Flat Files Tape Formats…
Web Services TIBCO webMethods
SAP NetWeaver SAP NetWeaver BI SAS Siebel
Messaging, and Web Services
Relational and Flat Files
Mainframe and Midrange
Unstructured Data and Files
Flat files ASCII reports HTML RPG ANSI LDAP
EDI–X12
EDI-Fact
RosettaNet
HL7
HIPAA
ebXML
HL7 v3.0
ACORD (AL3, XML)
XML
LegalXML
IFX
cXML
AST
FIX
Cargo IMP
MVR
Salesforce CRM
Force.com
RightNow
NetSuite
ADP Hewitt SAP By Design Oracle OnDemand
Packaged Applications
Industry Standards
XML Standards
SaaS/BPO
Social Media
Facebook Twitter
LinkedIn EMC/Greenplum Vertica
AsterData
MPP Appliances
PowerExchange for Hadoop
HDFS and Hive Adapters
Support pushdown of source and target connections to ensure maximum performance and scale
Native HDFS and Hive Source/Target Support Integrated development
environment with metadata and preview support
Perform any pre processing needed before ingestion
hStream with MapR – Continuous Ingestion
Transactions,
OLTP, OLAP
Social Media,
Web Logs
Documents,
Industry
Standards
Machine Device,
Scientific
Informatica Ultra Messaging
Streaming Data Continuously
Ne
two
rk F
ile S
yste
m (
NF
S)
Informatica Data Archive
Archiving to Hadoop
Production
Data
Optimized File Archive
Stored on Hadoop File System
• Archive data to optimized file format
for storage reduction
• Compressed (up to 90%)
• Immutable
• Accessible (SQL, ODBC, JDBC)
Informatica Data Archive
Archiving from Hadoop
File Archive
Parse and Prepare Data On
Hadoop
hParser
Informatica Hparser
Tackling Diversity of Big Data
设备/传感器
科学
scientific
Flat Files &
Documents Interaction data Industry Standards XML
The broadest coverage for Big Data
^/>Delimited<\^
Positional
Name = Value
social
Device/sensor
scientific
Productivity
• Visual parsing environment
• Predefined translations
Any DI/BI architecture
PIG EDW
MDM
Parse and Prepare Data on Hadoop
How does it work? hadoop … dt-hadoop.jar
… My_Parser /input/*/input*.txt
1. Define parser in HParser visual
studio
2. Deploy the parser on Hadoop
Distributed File System (HDFS)
3. Run HParser to extract data and
produce tabular format in
Hadoop
SWIFT MT
SWIFT MX
NACHA
FIX
Telekurs
FpML
BAI – V2.0\Lockbox
CREST DEX
IFX
TWIST
UNIFI (ISO 20022)
SEPA
FIXML
MISMO
B2B Standards
UN\EDIFACT
EDI-X12
EDI ARR
EDI UCS+WINS
EDI VICS
RosettaNet
OAGI
Financial
Healthcare
HL7
HL7 V3
HIPAA
NCPDP
CDISC
Insurance
DTCC-NSCC
ACORD-AL3
ACORD XML
IATA-PADIS
PLMXML
NEIM
Other
Easy example based visual enhancements and edits
Easy example based visual enhancements and edits
Enhanced Validations
Informatica HParser
Productivity: Data Transformation Studio
Out of the box transformations for all messages in all versions
Updates and new versions delivered from Informatica
Why Hadoop?
• Extremely large data sets
• Often information is split
across multi files
• Not sure what are we
looking for
An hParser Example
Proprietary web logs
Profiling and Discovering Data
Informatica Profiling for Hadoop
Discovery of Hadoop Issues/Anomalies
Repository
Informatica
Map-R
educe
Hadoop
Create/Run profile to discover Hadoop data attributes
Profile auto-converted to Hadoop queries/code (Hive, MapReduce, etc.)
Executed natively on Hadoop
Import metadata via native connectivity to Hadoop (Hive, HDFS, Hbase, etc.)
Review and share results via
browser or Eclipse clients • Single table/data object
• Cross table/data object
• Data Domain Discovery
HIVE
HDFS
HBase
1
3
2
beta
Hadoop Data Profiling Results
CUSTOMER_ID example
COUNTRY CODE example
3. Drilldown Analysis (into Hadoop Data)
2. Value &
Pattern
Analysis of
Hadoop Data
1. Profiling Stats: Min/Max Values, NULLs,
Inferred Data Types, etc.
ZIP CODE example
Drill down into actual
data values to inspect
results across entire data
set, including potential
duplicates
Value and Pattern
Frequency to isolated
inconsistent/dirty data or
unexpected patterns
Hadoop Data Profiling
results – exposed to
anyone in enterprise via
browser
Stats to identify
outliers and
anomalies in data
beta
Hadoop Data Domain Discovery
Finding functional meaning of Hadoop Data
1. Leverage INFA rules/mapplets to identify
functional meaning of Hadoop data
2. Sensitive data (e.g. SSN, Credit Card number,
etc.)
3. Liability and Compliance risk?
PHI: Protected Health Information
PII: Personally Identifiable Information
Scalable to look for/discover ANY Domain type
2. View/share report of data
domains/sensitive data
contained in Hadoop. Ability
to drill down to see suspect
data values. beta
Transforming and Cleansing Data
PowerCenter for Hadoop
Informatica Data Quality for Hadoop
Data Integration and Data Quality
Hadoop MapReduce Processing
Data Node SELECT
T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS
CUSTKEY,
customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME,
nation.N_REGIONKEY
FROM
(
SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx
FROM lineitem
GROUP BY L_ORDERKEY
) T1
JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)
JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY)
JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY)
WHERE nation.N_NAME = 'UNITED STATES'
) T2
INSERT OVERWRITE TABLE TARGET1 SELECT *
INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY, count(ORDERKEY2) GROUP BY
CUSTKEY;
Hive HQL
Informatica Developer 1. Informatica mapping translated to optimized
Hive HQL
2. HQL invokes custom UDF within Informatica
DTM for certain specialized data transformations
3. Optimized HQL translated to MapReduce
4. MapReduce and UDF executed on Hadoop
Data Node
Data Node
Data Nodes
UDF MapReduce
Informatica
Data Transformation Library
beta
Import existing PC artifacts into Hadoop development environment
Validate import logic before the actual import process to ensure compatibility
beta
Reuse and Import PC Metadata for Hadoop
Design integration and quality logic for Hadoop in a graphical and metadata driven environment
Configure where the integration logic should run – Hadoop or Native
beta
Design Mappings as Usual…
View complete generated and pushed down Hive or MR code from Hadoop mappings
beta
View Generated HiveQL
Orchestrating and Monitoring
Hadoop
Informatica Workflow & Administration for Hadoop
Mixed Workflow Orchestration
One workflow running tasks on hadoop and local environments
beta
Monitoring – Hive Query Plan Details
beta
Same hive query available in developer tool.
Monitoring – Hive Query Drilldown to M/R
beta
Traceability to
individual M/R
Jobs. Link to Job
Tracker URLs
View Hive
Query Details
Summary of job tracker
status
• Hadoop GA
(9.5.1 Release)
• Native HDFS and
Hive connectivity
• Integrated parsing
on Hadoop
• Data Integration &
Data Quality push
down execution on
Hadoop
• Data Discovery on
Hadoop
• Mixed workload
orchestration and
administration
Product Roadmap
Cap
ab
ilit
y
• Hadoop Beta
(9.5 Release)
• Native HDFS and Hive
connectivity
• Integrated parsing on
Hadoop
• Data Integration & Data
Quality push down
execution on Hadoop
• Data Discovery on
Hadoop
• Mixed workload
orchestration and
administration
• PowerExchange
for Hadoop
(HDFS and PC)
• Hparser
(including JSON
Parsing)
• Support for parallel
processing of large file
parsing
• Support for parsing of
archived files
• Managed file transfer
• Metadata Manager &
Lineage Integration
• Translation to PIG
support
• Profiling API on Hadoop
(call from Java or M/R)
• Persistence of profiling
stats on Hadoop
• Additional DI & DQ
transformations running
on Hadoop
Available Now 1H, 2012 2H, 2012 1H, 2013
• Hadoop Planned Release
• Beta/Early Access: August – Oct, 2012
• GA: 9.5.1 Release, December 2012
• PowerCenter Big Data Edition – Q3 2012 (Tentative)
• PowerCenter Standard Edition
• Enterprise Grid Option for PowerCenter
• PowerExchange for Hadoop
• PowerExchange for Social Media
• PowerExchange for Data Warehouse Appliance
• hParser
• PowerCenter on Hadoop (Available Dec 2012)
39
When Is It Available?