Upload
cloudera-inc
View
190
Download
0
Embed Size (px)
Citation preview
1© Cloudera, Inc. All rights reserved.
Apache Kudu Webinar SeriesUnderstanding and Unlocking the Value of Real-Time Data
Ryan Lippert | ClouderaMichele Goetz | Forrester (Special Guest)
2© Cloudera, Inc. All rights reserved.
Kudu Webinar SeriesPart 1: Lambda Architectures – Simplified by Apache KuduA look into the potential trouble involved with a lambda architecture, and how Apache Kudu can dramatically simplify real-time analytics.
Part 2: Extending the Capabilities of Operational and Analytical DatabasesAn examination of how Apache Kudu expands the set of use cases that Cloudera’s Operational and Analytical databases can handle.
Part 3: Data-in-Motion: Unlock the Value of Real-Time DataForrester will discuss their research into real-time data pipelines and analytics, and Cloudera will discuss how to make it a reality.
Part 4: Techincal Deep-Dive into Apache KuduAn in-depth examination of the technical architecture and design of Apache Kudu, straight from a PMCMember.New!
3© Cloudera, Inc. All rights reserved.
Updateable Analytic StorageSimple real-time analytics and updates with Apache Kudu
Kudu: Storage for fast analytics on fast data• Simplified architecture for building real-time analytic
applications• Designed for next-generation hardware for faster analytic
performance across frameworks • Native Hadoop storage engine
Flexibility for the right tools for the right use case in one platform• Only analytic database for big data with Kudu + Impala• Simple real-time applications with Kudu + Spark
Use cases• Time series data• Machine data analytics• Online reporting
STRUCTUREDSqoop
UNSTRUCTUREDKafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENTYARN
SECURITYSentry, RecordService
STORE
INTEGRATE
BATCHSpark, Hive, Pig
MapReduce
STREAMSpark
SQLImpala
SEARCHSolr
OTHERKite
NoSQLHBase
OTHERObject Store
FILESYSTEMHDFS
RELATIONALKudu
4© Cloudera, Inc. All rights reserved.
Ingest data of any type or volume
Process data as it arrives
Serve data to users and applications
Real-Time Data
5© Cloudera, Inc. All rights reserved.
Agenda
Drivers for agile, real-time data platformsThe key use cases that are driving businesses towards real time platforms?
Data on adoption trends for real-time technologiesWhat is Forrester seeing in the market for real-time technologies?
Deploying a real-time OSS achitecture to grow your businessHow can you build a scalable, cost-effective platform to grow your business?
© 2017 FORRESTER. REPRODUCTION PROHIB ITED.
Michele GoetzSpecial Guest SpeakerPrincipal Analyst Serving Enterprise Architecture Professionals
7© 2017 FORRESTER. REPRODUCTION PROHIB ITED.
Agenda
Drivers for agile, real-time data platformsThe key use cases that are driving businesses towards real time platforms?
Data on adoption trends for real-time technologiesWhat is Forrester seeing in the market for real-time technologies?
Deploying a real-time OSS achitecture to grow your businessHow can you build a scalable, cost-effective platform to grow your business?
8© 2017 FORRESTER. REPRODUCTION PROHIB ITED.
Superior CX depends on data and insights
9© 2017 FORRESTER. REPRODUCTION PROHIB ITED.
Fraud and risk management requires real-time data
10© 2017 FORRESTER. REPRODUCTION PROHIB ITED.
IoT heat map shows where data matters most, now
11© 2017 FORRESTER. REPRODUCTION PROHIB ITED.
Data bottlenecks are catalysts for transition
12© 2017 FORRESTER. REPRODUCTION PROHIB ITED.
Create a road map for a real-time, agile data platform
13© 2017 FORRESTER. REPRODUCTION PROHIB ITED.
Agenda
Drivers for agile, real-time data platformsThe key use cases that are driving businesses towards real time platforms?
Data on adoption trends for real-time technologiesWhat is Forrester seeing in the market for real-time technologies?
Deploying a real-time OSS achitecture to grow your businessHow can you build a scalable, cost-effective platform to grow your business?
14© 2017 FORRESTER. REPRODUCTION PROHIB ITED.
Leaders are focused on the technologies that allow data and insights to be consumed across the organization
What are your firm's plans for the following data driven initiatives?
Base: 3005 global data and analytics decision-makers. Source: Business Technographics® Global Data & Analytics Survey, 2016
Creating an organizational center of excellence for business intelligence
Combine content management and data management programs into a unified information management program
Changing our processes to promote data stewardship and sharing
Investing in platforms to and share out data content
Creating a business led data stewardship or governance program
Changing management incentives to promote data sharing
Implementing analytics insights in software systems to aid customers or support employee decisions.
Investing more in business friendly, self-service visualization and analytics
Engaging external services providers or strategic business consultants for data and analytics or insights services
Providing data preparation tools for self-service data management
Investing in distributed real time insight delivery technology
51%
51%
51%
51%
51%
49%
52%
52%
54%
54%
58%
22%
22%
22%
22%
22%
24%
22%
23%
22%
23%
22%
Expanding/Implemented Planning to implement within the next 12 months
15© 2017 FORRESTER. REPRODUCTION PROHIB ITED.
Base: 325 global data and analytics technology decision-makers. “Don’t know” not shown.Source: Business Technographics® Global Data & Analytics Survey, 2016
Which of the following describes your [TDM=”IT budget data and analytics technology or services”; BDM=”business budget
for data and analytics technology or services”] from 2015 to 2016?
Decrease by 5% to 10%
Don’t know
Decrease by 1-4%
Increase by more than 10%
Increase by 5% to 10%
Increase by 1-4%
Stay about the same
0% 5% 10% 15% 20% 25% 30% 35%
4%
5%
6%
6%
22%
26%
30%
54% of data and analytics technology decision-makers increased their budgets for data and analytics from 2015 to 2016
54%
16© 2017 FORRESTER. REPRODUCTION PROHIB ITED.
Companies of all sizes are spending millions for data & analytics
Note: Don’t know excluded. Base: 765*, 1,288 global data and analytics decision makersSource: Business Technographics® Global Data & Analytics Survey, 2016
Please estimate, in millions, how much your data and analytics budget is for 2016? (Note: Number is in US Dollars)
Less
than
$1 m
illion
$1 m
illion t
o und
er $1
0 millio
n
$10 m
illion t
o und
er $1
00 m
illion
$100
millio
n to u
nder
$500
millio
n
$500
millio
n to u
nder
$ 1 bi
llion
$1 bi
llion t
o und
er $5
billio
n
$5 bi
llion o
r more
55%
22%
9%
1% 1% 0% 0%
32% 30%
13%
4% 2% 2% 1%
SMB (20-999 employees)*
Enterprise (1,000 or more employees)
17© 2017 FORRESTER. REPRODUCTION PROHIB ITED.
Among the DM technologies Forrester tracks, interest for stream processing tools has grown the most YoYWhat are your firm's plans to use the following data management technologies?
Base: 2094 and *1805 global data and analytics technology decision-makers.Source: Business Technographics® Global Data & Analytics Survey, 2016
% with commitment
% with interest, but
no immediate plans
+5 p.p. +3 p.p. -2 p.p. -1 p.p. -2 p.p. -3 p.p.% with commitment (expanding, implemented, or planning to implement in the next 12 months)
Stream processing tools Inverted index database Distributed NoSQL databases
Hadoop Associative index databases
RDF, triple store
59%61% 63% 63%
60% 59%
64% 64%61% 62%
58% 56%
2015 2016
-20% -19% -19% -20% -19% -19%-13% -13% -16% -14% -14% -13%
18© 2017 FORRESTER. REPRODUCTION PROHIB ITED.
Base: Total: 2094Source: Business Technographics® Global Data & Analytics Survey, 2016
Which of the following are included in your plans for big data?
NoSQL other than Hadoop
A MPP (massively parallel processing) data warehouse
Semantic technologies (ontology building, search, auto curation, graph, etc.)
Hadoop (including Hbase or Accumulo)
Data anonymization or de-identification
Creating or building out a data lake
Marketing or digital data management platforms and service providers that brand their offerings as big data
Packaged analytics technologies that brand themselves as big data
Unstructured data mining / analytics
Distributed in memory databases, grids, analytics tools
Streaming analytics / computing
Large scale predictive modeling, data mining or other advanced analytics
Public cloud big data services
16%18%
22%23%23%
26%26%27%
28%30%
33%36%
40%
Streaming analytics high in the list of big data plans
19© Cloudera, Inc. All rights reserved.
Agenda
Drivers for agile, real-time data platformsThe key use cases that are driving businesses towards real time platforms?
Data on adoption trends for real-time technologiesWhat is Forrester seeing in the market for real-time technologies?
Deploying a real-time OSS achitecture to grow your businessHow can you build a scalable, cost-effective platform to grow your business?
20© Cloudera, Inc. All rights reserved.
Trend Towards Real-Time Data Platforms is ClearDrivers for Real-Time Platforms
• Enhancing customer experiences• Risk Management• Advancement of IoT and broader instrumentation
Adoption is Accelerating
• Top data-driven initiative by investment: distributed delivery of real-time data
• DM technology with highest momentum: stream processing• Top big data plans: streaming analytics is top 3• Broad, large investments: 90% of decision makers are either
continuing or increasing their investments in data and analytics; millions/billions being spent
21© Cloudera, Inc. All rights reserved.
The Underlying DriverWhat drives a use case to real-time?
High Frequency TradingAPT DetectionFraud DetectionPredictive MaintenanceNext Best OfferInventory ManagementShipping/Logistic SystemsCRM SystemsEmployee ManagementStrategic Planning
Real-time data management use cases are defined by a common set of characteristics.• Narrow time window in which to make a decision
(automated or manual)• Opportunity for the data points to change the
decision path• Decreasing value of data over time
Not all use cases have a pressing need for real-time data.• Broader strategic decisions, for example, do not
require real-time data input• Over time, decreases in HW costs and increases in
availability of real-time systems will lead most use cases to be conducted in real-time
Real Time
Some LatencyAcceptable
22© Cloudera, Inc. All rights reserved.
Moving to Real-Time and Leveraging AnalyticsWhat do we have to gain?
“Monitoring System”
Sensors are automatically monitored and programmed to deliver warnings when readings are delivered outside of an “optimal zone”.
Basic models developed over small subsets of data.
“Predictive System”
Ingestion and processing of all sensor data into an unlimited data store with analytic capabilities enables machine learning, which can provide automated optimization and predictive maintenance.
“Only 1 percent of data from an oil rig with 30,000 sensors is examined. The data that are used today are mostly for
anomaly detection and control, not optimization and prediction, which provide the greatest value.”
- McKinsey & Company
Traditional Architectures Real-Time Analytic Capabilities
23© Cloudera, Inc. All rights reserved.
Ingest data of any type or volume
Process data as it arrives
Serve data to users and applications
Real-Time Data
24© Cloudera, Inc. All rights reserved.
Ingestion at Cloudera• Apache Sqoop for data from
relational databases• Apache Flume for logs, event
based data• Apache Kafka is fast,
scalable, and fault-tolerant messaging
Partners, such as Streamsets, provide rich visualization tools
Ingestion in Real-TimeStream Ingestion is a Must for Many Use Cases
Ingestion isn’t just about internal business data anymore.• Traditional ingestion was internally focused, and often a matter
of moving data from one silo or system to another• Today, businesses aim to take in data from a variety of external
sources, IoT sensors, and machine-generated (user/network) data
Your data journey can’t start until the data arrives.• Each step of the ingest/process/serve data pipeline must occur
at real-time speed if decisions are to be made in time to affect the course of business
Visualization help practitioners understand their data.• Complex tasks can be made less complex via graphical
representations; data ingestion is no different
25© Cloudera, Inc. All rights reserved.
Stream Processing at Cloudera
Spark Streaming, the leading open-source framework for real-time use cases, is deployed in Cloudera’s real-time architectures.
Cloudera has the broadest base of Hadoop-adjacent experience with Spark and integrating it with Apache components.
Ingestion in Real-TimeUnlocking Value at Speed
For some use cases, batch just isn’t enough.• Batch processing can lead to bottlenecks and delays in data
transformations that cause missed opportunities.
Apache Spark is gaining momentum for a reason.• Leveraging Apache Spark for stream processing enables real-
time use cases with sub-second latency and best-in-class API’s.
Spark has a best-in-class ecosystem.• Machine learning (via MLlib) is seamlessly integrated into Spark.• Broadest set of vendors and contributors working on Spark
among available processing engines, leading to rapid innovation.
26© Cloudera, Inc. All rights reserved.
Data Serving at Cloudera
Apache Kudu provides batch analysis and real-time serving within the same storage layer
Apache HBase yields the best read/write performance
Cloudera Search enables SQL-like faceted search in natural language
Apache Kafka can be used to serve data to applications and users
Serving in Real-TimeInject Data into Real-Time Decisions
You need options that suit your use case.• Platform proliferation hurts IT departments as skillsets are
divided; fewer platforms with broad capabilities help.
Apache Kudu changes the game for open source software.• Combining real-time serving with analytic scans through a
relational database had taken a complex lambda architecture until Kudu
• Together, simplification and affordability should drive more use cases to real-time automated processes, in turn driving increased revenue, decreased risk, and better service for companies deploying Kudu
27© Cloudera, Inc. All rights reserved.
HDFS
Fast Scans, Analyticsand Processing of
Stored Data
Fast On-Line Updates &
Data Serving
Arbitrary Storage(Active Archive)
Fast Analytics(on fast-changing or
frequently-updated data)
Apache Kudu: Filling the Analytic Gap
Unchanging
Fast ChangingFrequent Updates
HBase
Append-Only
Real-Time
Kudu Kudu fills the GapModern analytic
applications often require complex data
flow & difficult integration work to move data between
HBase & HDFS
Analytic Gap
Pace of Analysis
Pace
of D
ata
28© Cloudera, Inc. All rights reserved.
Real-Time Data Analysis at WorkCustomer 360 “Next Best Offer 2.0”
Kafka Spark Streaming Kudu
Spark MLlib
ApplicationData
Sources
Individual Session
CustomerInteraction
Spark
Full Model/Learning
Data Request Sent For Stream Processing
Data Cleaned/Ordered/Processed, Then Delivered to Kudu for Modelling
User’s navigation returns the results they are looking for, in addition to offers and suggestions hyper-customized for them.
Illustrative, models will likely have >2 dimensions
29© Cloudera, Inc. All rights reserved.
Machine LearningKudu opens the door to machine learning
Kudu provides the ability to leverage real-time updates and analytic scans together - critical for many machine learning applications.
Source: GHOSTS IN THE MACHINE: Artificial intelligence, risks and regulation in financial markets
30© Cloudera, Inc. All rights reserved.
The Time for Real-Time Data and Analytics is Now.
And the platform for it is Cloudera Enterprise.
31© Cloudera, Inc. All rights reserved.