35
1 © Copyright 2012 EMC Corporation. All rights reserved.

© Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

  • Upload
    buidieu

  • View
    227

  • Download
    5

Embed Size (px)

Citation preview

Page 1: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

1 © Copyright 2012 EMC Corporation. All rights reserved.

Page 2: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

2 © Copyright 2012 EMC Corporation. All rights reserved.

THE ROAD TO BIG DATA ANALYTICS

Introduction to Greenplum Database and HD (Hadoop)

Page 3: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

3 © Copyright 2012 EMC Corporation. All rights reserved.

First There Was The Data Warehouse

• A new architecture to host data from multiple sources to support decision-making

• Why the Data warehouse exists:

– Centralization of high value data

– Tools to process data into information

– Highly regulated environment

Legacy EDW

Page 4: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

4 © Copyright 2012 EMC Corporation. All rights reserved.

Then The MPP Database Was Introduced

A new approach to database was required to handle new analytics environment

Why the MPP Database exists:

– Data got larger

– Queries got uglier

– Performance became critical

– R/SAS/Statistical languages need to run in-database

Page 5: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

5 © Copyright 2012 EMC Corporation. All rights reserved.

Now There Is Hadoop

Traditional systems weren‟t built to handle the storage/processing needs of Web 2.0

Why Hadoop exists: – Data volumes moved to the PB

range

– Raw (unstructured) forms of data needed to be processed

– Cost needed to be low

– Processing must scale with storage

Page 6: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

6 © Copyright 2012 EMC Corporation. All rights reserved.

Value Of Data Co-Processing With Hadoop

Page 7: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

7 © Copyright 2012 EMC Corporation. All rights reserved.

• Requires a different approach to how you leverage data

• Removes limitations around what data is worth storing or analyzing

• Augments analysis capabilities to create competitive advantages

Hadoop And MPP Represent A Paradigm Shift

Page 8: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

8 © Copyright 2012 EMC Corporation. All rights reserved.

• Healthcare – EMR/Claims data

• Financials – Ticker/Social media data

• Retail – Transaction/Customer sentiment data

• Insurance/Automobile – Telemetry data

Initially Used For Web Logs But Now…

Page 9: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

9 © Copyright 2012 EMC Corporation. All rights reserved.

Different Tools Have Different Strengths

STRUCTURED UNSTRUCTURED

SQL

RDBMS

Tables and Schemas GP MapReduce

Indexing

Partitioning

BI Tools

Page 10: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

10 © Copyright 2012 EMC Corporation. All rights reserved.

STRUCTURED UNSTRUCTURED

Hive MapReduce

Pig XML, JSON, … Flat files

Schema on load

Directories

No ETL

Java SequenceFile

Different Tools Have Different Strengths

Page 11: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

11 © Copyright 2012 EMC Corporation. All rights reserved.

Big Data Analytics Requires Both

STRUCTURED UNSTRUCTURED

SQL

RDBMS

Tables and Schemas GP MapReduce

Indexing

Partitioning

BI Tools

Hive MapReduce

Pig XML, JSON, …

Flat files Schema on load

Directories

No ETL

Java SequenceFile

Page 12: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

12 © Copyright 2012 EMC Corporation. All rights reserved.

Delivered in a Unified Platform

• One system for Multi-structured analysis

• MPP Performance for data load and query

• Massive Scale

• Unified Collaboration, Management & Monitoring

Page 13: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

13 © Copyright 2012 EMC Corporation. All rights reserved.

GREENPLUM DATABASE

Industry-Leading Massively Parallel Processing (MPP) Performance

Page 14: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

14 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Database

Extreme Performance for Analytics

Optimized for BI and analytics

– Deep integration with statistical packages

– High performance parallel implementations

• Simple and automatic

– Just load and query like any database

– Tables are automatically distributed across nodes

• Extremely scalable

– MPP shared-nothing architecture

– All nodes can scan and process in parallel

– Linear scalability by adding nodes

Page 15: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

15 © Copyright 2012 EMC Corporation. All rights reserved.

A Mature Enterprise Platform

PRODUCT FEATURES

CLIENT ACCESS & TOOLS

Multi-Level Fault Tolerance (RAID, Mirroring, DR with

Data Domain Boost)

Shared-Nothing MPP

Parallel Query Optimizer

Polymorphic Data Storage™

CLIENT ACCESS

ODBC, JDBC, OLEDB,

MapReduce, etc.

CORE MPP ARCHITECTURE

Parallel Dataflow Engine

gNet™ Software Interconnect

Scatter/Gather Streaming™ Data Loading

Online System Expansion Workload Management GREENPLUM DATABASE ADAPTIVE SERVICES

LOADING & EXT. ACCESS

Petabyte-Scale Loading

Trickle Micro-Batching

Anywhere Data Access

STORAGE & DATA ACCESS

Hybrid Storage & Execution (Row- & Column-Oriented)

In-Database Compression

Multi-Level Partitioning

Indexes – Btree, Bitmap, etc.

External Table Support

LANGUAGE SUPPORT

Comprehensive SQL

Native MapReduce

SQL 2003 OLAP Extensions

Programmable Analytics

Analytics Extensions

3rd PARTY TOOLS

BI Tools, ETL Tools

Data Mining, etc

ADMIN TOOLS

Greenplum Command Center

Greenplum Package Manager

Page 16: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

16 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Database

Performance Through Parallelism

• Scale-out architecture on standard commodity hardware

• Automatic parallelization

– Load and query like any database

– Automatically distributed tables across all nodes

– No need for manual partitioning or tuning

• Extremely scalable MPP shared-nothing architecture

– All nodes can scan and process in parallel

– Linear scalability by adding nodes

– On-line expansion when adding nodes

Loading

Interconnect

Page 17: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

17 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Database

Most Powerful Data Loading Capabilities

Industry leading performance at 10+TB per-hour per-rack

Scatter-Gather Streaming™ provides true linear scaling

Support for both large-batch and continuous real-time loading strategies

Enable complex data transformations “in-flight”

Transparent interfaces to loading via support files, application, and services

Greenplum load rates scale linearly with the number of racks, others do not. For example, two racks = >20TB/H

Page 18: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

18 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Database

Polymorphic Table StorageTM

• Enable Information Lifecycle Management (ILM)

• Storage types can be mixed within a table or database

– Four table types: heap, row-oriented AO, column-oriented, external

– Block compression: Gzip (levels 1-9), QuickLZ

• Provide the choice of processing model for any table or partition

TABLE „CUSTOMER‟

Mar „11

Apr „11

May „11

Jun „11

Jul „11

Aug „11

Sept „11

Oct „11

Nov „11

Row-oriented for HOT DATA Column-oriented for COLD DATA

Page 19: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

19 © Copyright 2012 EMC Corporation. All rights reserved.

Page 20: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

20 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Database

Parallel Query Optimizer PHYSICAL EXECUTION PLAN

FROM SQL OR MAPREDUCE

Gather Motion 4:1(Slice 3)

Sort

HashAggregate

HashJoin

Redistribute Motion 4:4(Slice 1)

HashJoin

Hash Hash

HashJoin

Hash

Broadcast Motion 4:4(Slice 2)

Seq Scan on motion

Seq Scan on customer

Seq Scan on lineitem

Seq Scan on orders

• Cost-based optimization looks for

the most efficient plan

• Physical plan contains scans, joins,

sorts, aggregations, etc.

• Global planning avoids sub-optimal

„SQL pushing‟ to segments

• Directly inserts „motion‟ nodes for

inter-segment communication

Page 21: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

21 © Copyright 2012 EMC Corporation. All rights reserved.

A supercomputing-based “soft-switch” responsible for – Efficiently pumping streams of data between motion nodes during query-plan

execution

– Delivers messages, moves data, collects results, and coordinates work among the segments in the system

High Performance gNet for Hadoop – Parallel query access

– Parallel data exchange

Gnet Software Interconnect

gNet Software Interconnect

Page 22: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

22 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Database

High Availability

Master Server Data Protection Replicated transaction logs for server failure

Optional RAID protection for drive failures

Upon server failure

Standby server activated

Administrator alerted

Orchestrated failover

Segment Server Data Protection Mirrored segments for server failures

Optional RAID protection for drive failures

Upon server failure Mirrored segments take over with no loss of service

Fast online differential recovery

Master

Segment Segment Segment Segment

Master

Page 23: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

23 © Copyright 2012 EMC Corporation. All rights reserved.

Simple To Manage

Greenplum Command Center

– Complete platform management and control

Greenplum Package Manager

– Automates install, uninstall, update, and query for analytics extensions

– Support package migration during upgrade, segment recovery, expansion, and standby initialization

Page 24: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

24 © Copyright 2012 EMC Corporation. All rights reserved.

In-Database Analytics

Bringing the power of parallelism to commonly-used modeling and analytics functions

In-database analytics

– SAS – HPA, Access, and Scoring Accelerator

– MADLib – An open-source library of advanced analytics functions

– Analytics extensions supported, including

▪ PostGIS - Geospatial support, PL/R - Statistical Computing, PL/Java, PL/Perl, etc.

MAD

lib

MAD

lib

Page 25: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

25 © Copyright 2012 EMC Corporation. All rights reserved.

SAS and Greenplum Partnership Deliver High-Performance Computing and MAD Analytics Access relational data-sets for agile analysis

– SAS/ACCESS provides fast, transparent and secure

access to Greenplum data.

Leverage database scalability for rapid model

deployment

– SAS Scoring Accelerator publishes models for

execution in parallel across the Greenplum cluster.

Build complex models at massive scales

– SAS HPA Appliance combines SAS In-Memory

Analytics with Greenplum parallelism to produce

record-breaking scalability and performance.

Page 26: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

26 © Copyright 2012 EMC Corporation. All rights reserved.

GREENPLUM HD Hadoop For The Enterprise

Page 27: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

27 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum HD

People And Skills Challenges

Establish a strategic vision – Roadmap for Hadoop and unified analytics

Hadoop Architecture Services – POC planning and deployment

– Installation and best practices

GPHD Training & Education – Business, Developer, Data Scientist,

Administration

Access to Analytics Workbench

Page 28: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

28 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum HD Platform Delivery Simple, efficient and scalable

Proven at scale in 1,000 node test environment (AWB) with worldwide EMC support

Purpose-built Hadoop infrastructure

Pluggable storage layer

Management & monitoring at scale

Page 29: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

29 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum HD Platform Delivery G

REEN

PLU

M C

OM

MA

ND

CEN

TER

Pluggable Storage Layer (HDFS API)

MapReduce Layer

Hadoop Tools (Pig, Hive, HBase, Zookeeper, Mahout, etc…)

Apache HDFS

Greenplum Chorus

Isilon OneFS

Page 30: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

30 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum HD Platform Delivery

•Integrates Spring and Hadoop Frameworks Spring Hadoop

•Scalable machine learning libraries Mahout

•Database for random, real time read/write access HBase

•System for SQL-like query data on top of HDFS Hive

•Procedural language that abstracts MapReduce Pig

•Highly reliable distributed coordination Zookeeper

•Framework for writing scalable data applications MapReduce

•Hadoop Distributed File System HDFS

Page 31: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

31 © Copyright 2012 EMC Corporation. All rights reserved.

Productivity with Hadoop

Establish Chorus Connection to GPHD Cluster

Browse HDFS files

Leverage gNet integration to parse HDFS using SQL interface

– Determine inherent data structure

Collaboration with business, analytics and infrastructure

Page 32: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

32 © Copyright 2012 EMC Corporation. All rights reserved.

Integration with Existing Technologies

Greenplum gNet

GREENPLUM HD GREENPLUM DATABASE

Java/Perl/Python Command Line PigLatin HQL ODBC JDBC

PARALLEL QUERY INTEGRATION

PARALLEL IMPORT/EXPORT

SQL HDFS

Data Access &Query Layer

Create end-to-end workflows

Leverage existing skills

Page 33: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

33 © Copyright 2012 EMC Corporation. All rights reserved.

Big Data Analytics Requires Both

STRUCTURED UNSTRUCTURED

Page 34: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC

34 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Delivers

Big Data in a Unified Analytics Platform

Page 35: © Copyright 2012 EMC Corporation. All rights reserved. 1 · PDF file© Copyright 2012 EMC Corporation. All rights reserved. 15 ... SERVICES LOADING & EXT. ... © Copyright 2012 EMC