34
How to Build Modern Data Architectures Both On Premises and in the Cloud Jacque Istok @jstok Pivotal Confidential–Internal Use Only

How to Build Modern Data Architectures Both On Premises and in the Cloud

  • Upload
    pivotal

  • View
    321

  • Download
    0

Embed Size (px)

Citation preview

Page 1: How to Build Modern Data Architectures Both On Premises and in the Cloud

How to Build Modern Data Architectures

Both On Premises and in the Cloud Jacque Istok @jstok

Pivotal Confidential–Internal Use Only

Page 2: How to Build Modern Data Architectures Both On Premises and in the Cloud

© Copyright 2017 Dell Inc. 2

The New Normal

DATA DEVICES

Law Enforcement

Media

Banks

Delivery Services

Marketers

Government

Private Investigators

/Lawyers

Individuals Employers

Data Users/Buyers

Analytic Services

Advertising

Catalog Co-ops

List Brokers

Websites

Information Brokers

Credit Bureaus Media

Archives

Data Aggregators

FINANCIAL

GOVERNMENT

PHONE/ TV

INTERNET MEDICAL

RETAIL

Page 3: How to Build Modern Data Architectures Both On Premises and in the Cloud

3 © 2017 Pivotal Software, Inc. All rights reserved.

Great organizations leverage software, analytics, and insights to take better actions and fundamentally change and pioneer entirely new operational business models

Page 4: How to Build Modern Data Architectures Both On Premises and in the Cloud

4 © 2017 Pivotal Software, Inc. All rights reserved.

Open Source Innovation

Parallel Processing

Cloud Native Continuous

Delivery

Loosely-coupled Microservices

Data Science and Machine Learning

Our View on Modern Analytics

Page 5: How to Build Modern Data Architectures Both On Premises and in the Cloud

© Copyright 2017 Dell Inc. 5

Pipeline of a Modern Data Driven App

Data Ecosystem Business Levers

Apps

MLlib PL

/X

Model Building

Model Tuning

Continuous Model Improvement

Data Feeds

Ingest Filter Enrich Route

Page 6: How to Build Modern Data Architectures Both On Premises and in the Cloud

Needs of a Modern Data Architecture

Apps / Microservices

Messaging / Integration

Stream / Event Processing

Data Science / ML Libraries

Data Lake / Deep Storage

Distributed MPP Analytics

•  MySQL •  Redis •  PostgreSQL •  Cassandra •  MongoDB

•  Kafka •  Spark Streaming •  Storm •  Samza

•  R libraries •  Python libraries •  Spark MLlib •  SAS

•  HDFS •  AWS S3 •  Azure ADLS •  Compatible

Hardware Implementations

•  Amazon EMR •  Hive •  Impala •  Apache HAWQ •  Redshift Spring Cloud

Data Flow

Page 7: How to Build Modern Data Architectures Both On Premises and in the Cloud

What Does It Take To Build Modern Analytics?

Page 8: How to Build Modern Data Architectures Both On Premises and in the Cloud

Users

Page 9: How to Build Modern Data Architectures Both On Premises and in the Cloud

User Centered Design

“A design approach that supports the entire development

process with user-centered activities, in order to create a

product that is easy to use and of added value to the

intended users.”

www.usabilitynet.org

Page 10: How to Build Modern Data Architectures Both On Premises and in the Cloud

Is It Useful?

usage = value rarely used = waste

Page 11: How to Build Modern Data Architectures Both On Premises and in the Cloud

Users Different Users Want Different Things

IT ●  Tasked with legacy

system integration

●  Controls security access

to comply with policy

and laws

●  Operationalization

●  Enterprise Architecture

Developers ●  Build applications to

interoperate

●  Develop reports and

dashboards

●  Extract and Transform

data

Business Analysts ●  Subject Matter

Experts

●  Primary consumer of

analytical models

●  SQL or BI expert

Data Scientists ●  Mathematically astute

●  Intellectual curiosity,

analytical exploration

●  Domain Knowledge

●  Communication in the

form of visualization

●  SQL and analytical

libraries expert

Page 12: How to Build Modern Data Architectures Both On Premises and in the Cloud

Analytical Application

s

Page 13: How to Build Modern Data Architectures Both On Premises and in the Cloud

Analytical Applications A Healthy Mix of Old and New

SQL Custom Apps BI/Reporting Machine Learning AI

Page 14: How to Build Modern Data Architectures Both On Premises and in the Cloud

Native Interfaces

Page 15: How to Build Modern Data Architectures Both On Premises and in the Cloud

Native Interfaces ANSI SQL

●  The Industry Standard to be clear,

less error-prone, and direct

●  Interoperability and consistency

●  It’s everywhere

Page 16: How to Build Modern Data Architectures Both On Premises and in the Cloud

Native Interfaces Proprietary SQL

●  Industry Non Standard

●  PostgreSQL PL/PGSQL

●  Teradata SQL

●  Oracle PL/SQL

Page 17: How to Build Modern Data Architectures Both On Premises and in the Cloud

Linear Systems •  Sparse and Dense Solvers •  Linear Algebra

Matrix Factorization •  Singular Value Decomposition (SVD) •  Low Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Ordinal Regression •  Cox Proportional Hazards Regression •  Elastic Net Regularization •  Robust Variance (Huber-White),

Clustered Variance, Marginal Effects

Other Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Apriori) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Random Forest •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation •  Naïve Bayes •  Support Vector Machines (SVM) •  Prediction Metrics •  K-Nearest Neighbors

Descriptive Statistics Sketch-Based Estimators •  CountMin (Cormode-Muth.) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation and Covariance Summary

Utility Modules Array and Matrix Operations Sparse Vectors Random Sampling Probability Functions Data Preparation PMML Export Conjugate Gradient Stemming Sessionization Pivot Path Functions Encoding Categorical Variables

Inferential Statistics Hypothesis Tests

Time Series •  ARIMA

May 2017

Graph •  PageRank •  Single Source Shortest Path

Native Interfaces Machine Learning, Statistical, Graph, Path Analytics

Page 18: How to Build Modern Data Architectures Both On Premises and in the Cloud

Designed for very large graphs (billions of vertices/edges) No need to move data and transform for external graph engine Familiar SQL interface

Algorithms: •  All pairs shortest path* •  Breadth first traversal* •  Connected components* •  Multiple graph measures* •  PageRank •  Single source shortest path

Native Interfaces Graph Analytics

Page 19: How to Build Modern Data Architectures Both On Premises and in the Cloud

Native Interfaces Programmatic

•  Current Computing Interfaces •  User Defined Types •  User Defined Functions •  User Defined Aggregates

•  Foundational work for containerized Python and R compute environments

+ +

Page 20: How to Build Modern Data Architectures Both On Premises and in the Cloud

GPText:ANSISQL+Text•  LeveragingApacheSolrandGPDB•  5yearscommercialproducConexperience•  ApacheMadLibintegraConformachinelearningontextdata•  PL/PythonandPL/JavaintegraConforNaturalLanguageProcessingUseCases•  CommunicaConscomplianceandmonitoring•  CustomerSenCmentanalysis•  DocumentSearchandQuery•  SocialMediaProcessing,etc.

Native Interfaces Text Analytics

Page 21: How to Build Modern Data Architectures Both On Premises and in the Cloud

Round earth calculations Current Key Features: •  Points, Lines, Polygons,

Perimeter, Area, Intersection, Contains, Distance, Long/Lat

Spatial Indexes & Bounding Boxes

Raster Support

Native Interfaces GeoSpatial Analytics

Page 22: How to Build Modern Data Architectures Both On Premises and in the Cloud

Multi Structured

Data

Page 23: How to Build Modern Data Architectures Both On Premises and in the Cloud

Structured Data

Multi Structured Data

...

Unstructured / Semi-structured

Page 24: How to Build Modern Data Architectures Both On Premises and in the Cloud

Sources &

Pipelines

Page 25: How to Build Modern Data Architectures Both On Premises and in the Cloud

Analyze, interact, and engage with diverse data sources, localities and temperatures

Real Separation of Compute and Data Source

Hadoop Data Lakes

The image cannot be displayed. Your

Public Cloud Data Lakes Hybrid Local

Massively Parallel Analytics Environment

Page 26: How to Build Modern Data Architectures Both On Premises and in the Cloud

Spring Cloud Data Flow is a Microservices toolkit for building data integration and real-time data processing pipelines. The Data Flow server provides interfaces to compose and deploy pipelines onto onto modern runtimes such as Cloud Foundry, Kubernetes, Apache Mesos or Apache YARN.

Spring Cloud Data Flow (SCDF)

Ingest - Route - Filter - Enrich

Page 27: How to Build Modern Data Architectures Both On Premises and in the Cloud

Apache Kafka and SCDF

Data Feeds

Integrated Data Ingest layer

SCDF (Cloud ETL 2.0)

Page 28: How to Build Modern Data Architectures Both On Premises and in the Cloud

Flexible Deploymen

t

Page 29: How to Build Modern Data Architectures Both On Premises and in the Cloud

Run Your Analytics Anywhere On-Premises Private Cloud Public Cloud

•  Infrastructure Agnostic: A portable, 100% software solution •  Same platform, no switching/migration cost

Page 30: How to Build Modern Data Architectures Both On Premises and in the Cloud

ANALYTICAL APPLICATIONS

NATIVE INTERFACES

MULTI- STRUCTURED DATA

SOURCES & PIPELINES

Structured Data

JDBC, ODBC

SQL

ANSI SQL

USERS

FLEXIBLE DEPLOYMENT

Local Storage

Other RDBMSes Spark GemFire

Cloud Object

Storage HDFS

JSON, Apache AVRO, Apache Parquet, XML, & More

Teradata SQL

Other DB SQL

Apache MADlib

ML/Statistics/Graph

Python. R, Java, Perl, C

Programmatic

Apache SOLR

Text

PostGIS

GeoSpatial

Custom Apps BI / Reporting Machine Learning AI

IT Dev Business Analysts

Data Scientists

On-Premises Public Clouds

Private Clouds

Fully Managed Clouds

MODERN CLOUD ANALYTICS PLATFORM

Kafka ETL Spring Cloud

Data Flow

Massively Parallel (MPP)

PostgresSQL Kernel

Petabyte Scale

Loading

Query Optimizer

(GPORCA)

Workload Manager

Polymorphic Storage

Command Center

SQL Compatibility

(Hyper-Q)

Modern Cloud Analytics Platform

Page 31: How to Build Modern Data Architectures Both On Premises and in the Cloud

© Copyright 2017 Dell Inc. 31

Page 32: How to Build Modern Data Architectures Both On Premises and in the Cloud

FRAUD MANAGEMENT RISK MANAGEMENT

CYBERSECURITY MANUFACTURING

PREDICTIVE MAINTENANCE

ELECTRICITY GRID

Pivotal Greenplum: Not just a Database An Analytics Solution for every challenge

Page 33: How to Build Modern Data Architectures Both On Premises and in the Cloud

Pivotal Greenplum: Learn More

Find out more about Pivotal Greenplum at

https://pivotal.io/pivotal-greenplum

OR learn more about the open source at

http://greenplum.org/

OR give it a try yourself at

Amazon AWS or Microsoft Azure or via Download

Page 34: How to Build Modern Data Architectures Both On Premises and in the Cloud

Thank you! Jacque Istok @jstok

Pivotal Confidential–Internal Use Only