Upload
pivotal
View
321
Download
0
Embed Size (px)
Citation preview
How to Build Modern Data Architectures
Both On Premises and in the Cloud Jacque Istok @jstok
Pivotal Confidential–Internal Use Only
© Copyright 2017 Dell Inc. 2
The New Normal
DATA DEVICES
Law Enforcement
Media
Banks
Delivery Services
Marketers
Government
Private Investigators
/Lawyers
Individuals Employers
Data Users/Buyers
Analytic Services
Advertising
Catalog Co-ops
List Brokers
Websites
Information Brokers
Credit Bureaus Media
Archives
Data Aggregators
FINANCIAL
GOVERNMENT
PHONE/ TV
INTERNET MEDICAL
RETAIL
3 © 2017 Pivotal Software, Inc. All rights reserved.
Great organizations leverage software, analytics, and insights to take better actions and fundamentally change and pioneer entirely new operational business models
4 © 2017 Pivotal Software, Inc. All rights reserved.
Open Source Innovation
Parallel Processing
Cloud Native Continuous
Delivery
Loosely-coupled Microservices
Data Science and Machine Learning
Our View on Modern Analytics
© Copyright 2017 Dell Inc. 5
Pipeline of a Modern Data Driven App
Data Ecosystem Business Levers
Apps
MLlib PL
/X
Model Building
Model Tuning
Continuous Model Improvement
Data Feeds
Ingest Filter Enrich Route
Needs of a Modern Data Architecture
Apps / Microservices
Messaging / Integration
Stream / Event Processing
Data Science / ML Libraries
Data Lake / Deep Storage
Distributed MPP Analytics
• MySQL • Redis • PostgreSQL • Cassandra • MongoDB
• Kafka • Spark Streaming • Storm • Samza
• R libraries • Python libraries • Spark MLlib • SAS
• HDFS • AWS S3 • Azure ADLS • Compatible
Hardware Implementations
• Amazon EMR • Hive • Impala • Apache HAWQ • Redshift Spring Cloud
Data Flow
What Does It Take To Build Modern Analytics?
Users
User Centered Design
“A design approach that supports the entire development
process with user-centered activities, in order to create a
product that is easy to use and of added value to the
intended users.”
www.usabilitynet.org
Is It Useful?
usage = value rarely used = waste
Users Different Users Want Different Things
IT ● Tasked with legacy
system integration
● Controls security access
to comply with policy
and laws
● Operationalization
● Enterprise Architecture
Developers ● Build applications to
interoperate
● Develop reports and
dashboards
● Extract and Transform
data
Business Analysts ● Subject Matter
Experts
● Primary consumer of
analytical models
● SQL or BI expert
Data Scientists ● Mathematically astute
● Intellectual curiosity,
analytical exploration
● Domain Knowledge
● Communication in the
form of visualization
● SQL and analytical
libraries expert
Analytical Application
s
Analytical Applications A Healthy Mix of Old and New
SQL Custom Apps BI/Reporting Machine Learning AI
Native Interfaces
Native Interfaces ANSI SQL
● The Industry Standard to be clear,
less error-prone, and direct
● Interoperability and consistency
● It’s everywhere
Native Interfaces Proprietary SQL
● Industry Non Standard
● PostgreSQL PL/PGSQL
● Teradata SQL
● Oracle PL/SQL
Linear Systems • Sparse and Dense Solvers • Linear Algebra
Matrix Factorization • Singular Value Decomposition (SVD) • Low Rank
Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Ordinal Regression • Cox Proportional Hazards Regression • Elastic Net Regularization • Robust Variance (Huber-White),
Clustered Variance, Marginal Effects
Other Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Apriori) • Topic Modeling (Parallel LDA) • Decision Trees • Random Forest • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation • Naïve Bayes • Support Vector Machines (SVM) • Prediction Metrics • K-Nearest Neighbors
Descriptive Statistics Sketch-Based Estimators • CountMin (Cormode-Muth.) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Correlation and Covariance Summary
Utility Modules Array and Matrix Operations Sparse Vectors Random Sampling Probability Functions Data Preparation PMML Export Conjugate Gradient Stemming Sessionization Pivot Path Functions Encoding Categorical Variables
Inferential Statistics Hypothesis Tests
Time Series • ARIMA
May 2017
Graph • PageRank • Single Source Shortest Path
Native Interfaces Machine Learning, Statistical, Graph, Path Analytics
Designed for very large graphs (billions of vertices/edges) No need to move data and transform for external graph engine Familiar SQL interface
Algorithms: • All pairs shortest path* • Breadth first traversal* • Connected components* • Multiple graph measures* • PageRank • Single source shortest path
Native Interfaces Graph Analytics
Native Interfaces Programmatic
• Current Computing Interfaces • User Defined Types • User Defined Functions • User Defined Aggregates
• Foundational work for containerized Python and R compute environments
+ +
GPText:ANSISQL+Text• LeveragingApacheSolrandGPDB• 5yearscommercialproducConexperience• ApacheMadLibintegraConformachinelearningontextdata• PL/PythonandPL/JavaintegraConforNaturalLanguageProcessingUseCases• CommunicaConscomplianceandmonitoring• CustomerSenCmentanalysis• DocumentSearchandQuery• SocialMediaProcessing,etc.
Native Interfaces Text Analytics
Round earth calculations Current Key Features: • Points, Lines, Polygons,
Perimeter, Area, Intersection, Contains, Distance, Long/Lat
Spatial Indexes & Bounding Boxes
Raster Support
Native Interfaces GeoSpatial Analytics
Multi Structured
Data
Structured Data
Multi Structured Data
...
Unstructured / Semi-structured
Sources &
Pipelines
Analyze, interact, and engage with diverse data sources, localities and temperatures
Real Separation of Compute and Data Source
Hadoop Data Lakes
The image cannot be displayed. Your
Public Cloud Data Lakes Hybrid Local
Massively Parallel Analytics Environment
Spring Cloud Data Flow is a Microservices toolkit for building data integration and real-time data processing pipelines. The Data Flow server provides interfaces to compose and deploy pipelines onto onto modern runtimes such as Cloud Foundry, Kubernetes, Apache Mesos or Apache YARN.
Spring Cloud Data Flow (SCDF)
Ingest - Route - Filter - Enrich
Apache Kafka and SCDF
Data Feeds
Integrated Data Ingest layer
SCDF (Cloud ETL 2.0)
Flexible Deploymen
t
Run Your Analytics Anywhere On-Premises Private Cloud Public Cloud
• Infrastructure Agnostic: A portable, 100% software solution • Same platform, no switching/migration cost
ANALYTICAL APPLICATIONS
NATIVE INTERFACES
MULTI- STRUCTURED DATA
SOURCES & PIPELINES
Structured Data
JDBC, ODBC
SQL
ANSI SQL
USERS
FLEXIBLE DEPLOYMENT
Local Storage
Other RDBMSes Spark GemFire
Cloud Object
Storage HDFS
JSON, Apache AVRO, Apache Parquet, XML, & More
Teradata SQL
Other DB SQL
Apache MADlib
ML/Statistics/Graph
Python. R, Java, Perl, C
Programmatic
Apache SOLR
Text
PostGIS
GeoSpatial
Custom Apps BI / Reporting Machine Learning AI
IT Dev Business Analysts
Data Scientists
On-Premises Public Clouds
Private Clouds
Fully Managed Clouds
MODERN CLOUD ANALYTICS PLATFORM
Kafka ETL Spring Cloud
Data Flow
Massively Parallel (MPP)
PostgresSQL Kernel
Petabyte Scale
Loading
Query Optimizer
(GPORCA)
Workload Manager
Polymorphic Storage
Command Center
SQL Compatibility
(Hyper-Q)
Modern Cloud Analytics Platform
© Copyright 2017 Dell Inc. 31
FRAUD MANAGEMENT RISK MANAGEMENT
CYBERSECURITY MANUFACTURING
PREDICTIVE MAINTENANCE
ELECTRICITY GRID
Pivotal Greenplum: Not just a Database An Analytics Solution for every challenge
Pivotal Greenplum: Learn More
Find out more about Pivotal Greenplum at
https://pivotal.io/pivotal-greenplum
OR learn more about the open source at
http://greenplum.org/
OR give it a try yourself at
Amazon AWS or Microsoft Azure or via Download
Thank you! Jacque Istok @jstok
Pivotal Confidential–Internal Use Only