Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
© 2013 IBM Corporation June 25, 2013
InfoSphere Streams
© 2013 IBM Corporation 2
Agenda
Value Proposition to the Business Technical Overview
– Product Components – Installation Requirements – Streams Application Development Overview
• Streams Processing Language (SPL) • Toolkits • Streams Studio
– Runtime Architecture – Monitoring and Managing – Publishing SPL applications
Alignment with Big Data Strategy Skill Set Required for this type of application Use Cases
© 2013 IBM Corporation 3
What can Streams do for you?
Analyze and react to events as they are happening Take advantage of more sources of data in “true” real time Build models on your most up-to-the-second information that
will help predict what happens next
Streams is a middleware and language for building and running analytic applications operating on data in motion – Scale – easily handles a few events per second through multiple
millions of events per second – Reaction time – possible to get actionable results in much less than a
second (< 20 micros possible)
Enables TRUE situational awareness
© 2013 IBM Corporation 4
Information
from Everywhere
Radical Flexibility
Extreme Scalability
Volume
of Tweets created daily
12 terabytes
trade events per second
Velocity
5 million
New Era of Computing Requires
from surveillance cameras
Variety
100’s video
feeds
© 2013 IBM Corporation 5
InfoSphere Streams – In-motion Analytics on Big Data
InfoSphere Streams covers: – Volume
• When scaled has the ability to handle
terabytes per second
– Variety • Ability to handle all kinds of formats, invoking
analytics to this incoming data
– Velocity • Microsecond latency
Millions of events per
second
Microsecond Latency
Traditional / Non-traditional
data sources
Real time delivery
Powerful
Analytics
Algo
Trading
Telco churn
predict
Smart
Grid
Cyber
Security Government /
Law enforcement
ICU
Monitoring
Environment
Monitoring
When results are required in
less than seconds, not hours
IBM InfoSphere Streams v3.1
© 2013 IBM Corporation 6
Comprehensive Development Tools
Scale-out Architecture Sophisticated Analytics with
Toolkits & Accelerators
• Clustered runtime for near-
limitless capacity
• Large scale deployment
• RHEL v5.3 and above
• CentOS v6.0 and above
• X86 & Power multicore HW
• SUSE Linux Enterprise
Server 11.2 and above
• InfiniBand support
• Ethernet support
• Eclipse IDE
• Web console
• Drag & drop editor
• Instance graph
• Streams visualization
• Streams debugger
• HA app development
• Java improvements
• Mapped operators
• CEP, Database, Data Explorer,
DataStage, Finance, SPSS, R
Geospatial, Internet, Mining,
Messaging with JMS adapter,
Standard, Text, Time Series
Toolkits
• Telco, Machine & Social Data
Accelerators
Front Office 3.0
© 2013 IBM Corporation 7
InfoSphere Streams – Product and License Information
Current major version of InfoSphere Streams is v3.1 – Released May, 2013
Available in 3 editions: – InfoSphere Streams for production environments, licensed via Resource Value
Units (RVUs) based on activated processor cores
– InfoSphere Streams for Non-Production Environment – For development purposes,
licensed via RVUs
– InfoSphere Streams Developer Edition – For development, licensed via Authorized
Users (AU)
© 2013 IBM Corporation 8
Current fact finding
Analyze data in motion – before it is stored
Low latency paradigm, push model
Data driven: bring the data to the query
Historical fact finding
Find and analyze information stored on disk
Batch paradigm, pull model
Query-driven: submits queries to static data
Traditional Computing Stream Computing
Query Data Results Data Query Results
Stream Computing – Analyze Data in Motion
© 2013 IBM Corporation 9
InfoSphere Streams – How it works
Achieve scale:
By partitioning applications into software components
By distributing across stream-connected hardware hosts
Infrastructure provides services for
Scheduling analytics across hardware hosts,
Establishing streaming connectivity
Transform
Filter / Sample
Classify
Correlate
Annotate
Where appropriate:
Elements can be fused together
for lower communication latency
Continuous ingestion Continuous analysis
© 2013 IBM Corporation 10
Scalable Stream Processing
InfoSphere Streams provides:
– a programming model for defining data flow graphs consisting of data
sources (inputs), operators, and sinks (outputs)
– controls for fusing operators into processing elements (PEs)
– infrastructure to support the composition of scalable stream processing
applications from these components
– deployment and operation of these applications across distributed
processing nodes
© 2013 IBM Corporation 11
Installation Requirements
Hardware:
– x86 (64-bit) or IBM POWER7 (64-bit)
– Minimum 1GB of RAM
– Disk space 4GB
Software: – Operating System:
• Red Hat Enterprise Linux 5.3 or later
• Red Hat Enterprise Linux 6.1 or later
• CentOS 5.3 or later, 6.1 or later (64-bit)
• SUSE Linux Enterprise Server (SLES) V11.2 or later
– Java version 6 • IBM Java SE version 6 SDK is included in Streams
Minimum Streams configuration: one or three nodes – One node: entry-level performance, no redundancy – Three nodes: Redundancy, fail-over for high availability – Add additional nodes, one at a time
• Increase performance and availability
. . .
© 2013 IBM Corporation 12
Getting Started is Quick and Easy
First Steps – Guides user through post install setup steps
– Verify installation and configuration
– Create and manage Streams Instances
– Go from install to running instance in a few clicks
– Install and configure Streams Studio
– Access to all Links to InfoSphere Streams web
sites
Streams Studio – Eclipse-ready tool that enables you to create,
edit, visualize, test, debug and run Streams
applications
– All required packages included in install
© 2013 IBM Corporation 13
Creating Streams Applications - Streams Processing Language
(SPL)
Designed for stream
computing
– Define a streaming-data
flow graph
– Rich set of data types to
define tuple attributes
Declarative – Operator invocations name
the input and output
streams
– Referring to streams by
name
Procedural support – Full-featured C++/Java-like
language
– Custom logic in operator
invocations
– Expressions in attribute
assignments and parameter
definitions
Extensible – User-defined data types
– Custom functions written in SPL
or a native language (C++ or
Java)
– User-defined operators written
in C++ or Java
© 2013 IBM Corporation 14
InfoSphere Streams Objects – Development View
directory: "/img"
filename: "farm"
directory: "/img"
filename: "bird"
directory: "/opt"
filename: "java"
directory: "/img"
filename: "cat"
Operator: Fundamental concept of the SPL
– Process data from other streams and can
produce new streams
Stream: data flow between any two operators
– Tuple: data record in the stream, with fixed set of Attributes
– Attribute: variable – Schema: describes attributes in a tuple – An operator reads in a stream but outputs a
different stream – Streams do not survive past an operator
boundary - the data may, but in a new stream
Streams Application
stream
tuple
height: 640
width: 480
data:
height: 1280
width: 1024
data:
height: 640
width: 480
data:
operator
attribute
© 2013 IBM Corporation 15
Toolkits
Standard Toolkit –Relational Operators
Filter Sort
Functor Join
Punctor Aggregate
–Adapter Operators FileSource UDPSource
FileSink UDPSink
DirectoryScan Export
TCPSource Import
TCPSink MetricsSink
–Utility Operators Custom Split
Beacon DeDuplicate
Throttle Union
Delay ThreadedSplit
Barrier DynamicFilter
Pair Gate
JavaOp
Database Toolkit ODBCAppend ODBCEnrich
ODBCSource SolidDBEnrich
DB2SplitDB DB2PartitionedAppend
Internet Toolkit InetSource
HTTP FTP HTTPS
FTPS RSS file
Financial Toolkit
Data Mining Toolkit
Big Data toolkit
Text Toolkit
TimeSeries Toolkit
R Geospatial Toolkit
Complex Event Processing
Toolkit
Messaging Toolkit with JMS
adapter
User-Defined Toolkits
Supports: DB2 LUW, IDS, solidDB,
Oracle, SQL Server, MySQL,
PureData for Analytics
© 2013 IBM Corporation 16
SPL Development with InfoSphere Streams Studio
InfoSphere Streams Studio is an Eclipse-based tool that enables developers
to create, edit, visualize, test, debug, run SPL and SPL mixed-mode
applications.
© 2013 IBM Corporation 17
SPL Development with InfoSphere Streams Studio
Streams Studio consists of the following major features: – SPL Project and SPL Application Set Project support
– SPL editor
– Toolkit model, operator model, and function model editors
– Streams Explorer
– Visualizer
– Launchers for standalone and distributed applications
– Project Explorer
– Graph view
– Metrics view
– Log viewing support
– BigData Task Launcher
– Debugger
Application graph view
© 2013 IBM Corporation 18
Streams Studio – New Features
Team support – Support Eclipse team APIs – SPL projects can now be shared
– Can now check SPL resources in and out of source code control systems
– Validated with Rational Team Concert and Subversion
Project Explorer – Refactoring: clone, delete, rename
– New wizards available for XML parse and Data Stage
SPL Editor – Improved syntax highlighting
– XML Support
– Find references support
Publish – Select a build configuration and publish
– Wizard lets you specify location and optionally an instance
© 2013 IBM Corporation 19
Streams Studio – Drag & Drop Graphical Editor
Build applications by dragging & dropping operators
Palette of operators provided by toolkits
Define Streams by connecting operator ports
Graphical & SPL source code views automatically synchronized
© 2013 IBM Corporation 20
Application Graph
A compiled view of your application topology
Available in Streams Studio and Streams Console
Consistent with editor
Customize the application graph views: – PEs, Byte, Tuple
© 2013 IBM Corporation 21
InfoSphere Streams Objects – Runtime View
Instance – Runtime instantiation of Streams Engine
that runs across 1 or more hosts (nodes) – A collection of components and services
Processing Element (PE) – Fundamental execution unit – PEs can have one operator or many
operators fused together by the Streams compiler
Job – A deployed Streams application
executing in an instance – Consists of one or more PEs
Instance
Job
Node
Stream 1 PE PE
Node
PE
Stream 1
Stream 2
Stream 3
Stream 3
Stream 4
Stream 5
operator
© 2013 IBM Corporation 22
InfoSphere Streams – Runtime
x86 host x86 host x86 host x86 host
Optimizing scheduler assigns jobs to hosts, and continually manages resource allocation
Commodity hardware – laptop, blades or high performance clusters
Meters
Company
Filter Usage
Model
Usage
Contract
Text
Extract
Season
Adjust Daily
Adjust
Temp
Action
© 2013 IBM Corporation 23
InfoSphere Streams – Runtime
x86 host x86 host x86 host x86 host x86 host
Optimizing scheduler assigns PEs to hosts, and continually manages resource allocation
Commodity hardware – laptop, blades or high performance clusters
Meters
Company
Filter
Usage
Model
Usage
Contract
Temp
Action
Dynamically add hosts and jobs
New jobs work with existing jobs
Text
Extract
Degree
History
Compare
History Store
History
Meters
Season
Adjust Daily
Adjust
Text
Extract
© 2013 IBM Corporation 24
Streams Runtime Includes High Availability
x86 node
x86
node
x86
node
x86
node
x86
node
Processing Element
Container
Processing Element
Container
Processing Element
Container
Processing Element
Container
Processing Element
Container
PEs on failing nodes can be moved automatically, with communications re-routed
PEs on busy nodes can be moved manually by the Streams administrator
© 2013 IBM Corporation 25
Streams Runtime Node Pools facilitate High Availability
Node pool 1
Node pool 4
Node pool 2
Node pool 3
HA application design pattern
•Source job exports stream, enriched with tuple ID
•Jobs 1 & 2 process in parallel, and export final streams
•Sink job imports streams, discards duplicates, alerts on missing tuples
x86 node
x86
node
x86
node
x86
node
x86
node
Processing Element
Container
Processing Element
Container
Processing Element
Container
Processing Element
Container
Processing Element
Container
Source
Job 1 Job 1
Job 1 Job 1
Job 2 Job 2 Job 2
Job 2
Sink
© 2013 IBM Corporation 26
Streams Console – Application Repository
Publish Streams apps to the Console
Launch Streams Apps from the Streams console
Monitor & manage launched Streams apps
Visualize stream data for launched Streams apps
© 2013 IBM Corporation 27
Monitoring and Managing InfoSphere Streams
Use Streams Studio
– The Streams Explorer view enables you to manage your Streams environment. • View available hosts
• View and update the location of the InfoSphere Stream installation
• Manage Streams instances and jobs
• View and edit toolkit locations
• Auto refresh
• Enhanced properties view (metrics, data)
© 2013 IBM Corporation 28
Monitoring and Managing InfoSphere Streams
Use Streams Studio (continued)
– “Instance Graph” available in Streams Studio and the Web Console
– Visual monitoring of application health and metrics
– Quickly identify issues using customizable views • Job, PE, Operator and Host containment views
• Configurable metric based coloring schemes
– Support for filtering, layout, search, hover
– Record a session
© 2013 IBM Corporation 29
Monitoring and Managing InfoSphere Streams
Use Streams Console – web-based graphical user interface that is provided by the Streams Web Service (SWS)
Overall instance status
Running jobs status
© 2013 IBM Corporation 30
Streams Visualization
Easily visualize streams data
Dynamically add new views to running applications
Views exposed to 3rd party charting
Integration with BigInsights & Data Explorer
Charts provided out of the box – Line graph, bar chart, table
Views can filter & buffer data
© 2013 IBM Corporation 31
Publish an SPL application
Select a build configuration and publish
Wizard lets you specify location and optionally
an instance – Zip file containing resources required to launch
created at location
– If instance specified, application registered with
instance
Can now launch application from Streams
Console
© 2013 IBM Corporation 32
Skills to Run Streams Application
Experience with the following are suggested for running Streams
application: – Linux Operating System (preferably RHE and CentOS)
– Eclipse IDE (Streams Studio)
– Streams Processing Language (SPL)
– Other IBM big data products in case of integration
Streams 3.1 Enhancements: Performance
Improved Java Performance – Reduced data copying
Performance improvements for operations
using maps and lists – 2x to 10x improvement for bounded maps and lists
33
Streams 3.1 Enhancements: Administration
RPM Package Manager file for easier installation – Simplicity, consistency, automation
– Simpler install for large-scale deployment
Support for SUSE Linux Enterprise Server (SLES) – Additional O/S
Simplified instance recovery – Existing recovery command now automatically
restarts services in the right order
Data transport subnetwork – Separate runtime management traffic from data traffic
– Keeps system management controls responsive under high load
34
Streams 3.1 Enhancements: Administration (cont’d)
Support for Apache Zookeeper – Highly resilient name service (optional)
– No shared file system requirement
– Improves availability
– Also used by BigInsights
Simplified restart of application components – Improves placement decisions, host load balancing
– Restart collocated PEs automatically
REST API support – Get state and metrics information
– For selected popular services
Streams Console visual enhancements – OneUI, IDX widgets
– Consistent with other IBM offerings
35
Streams 3.1 Enhancements: Development & Analytics
SPLDOC comments in SPL for documentation – Analogous to Javadoc
– Generate documentation for SPL toolkits and apps
New Time Series Toolkit operators – Incremental interpolation:
replace missing values
– Distribution: compute quartiles, median
– CrossCorrelate
– LPC: Linear Predictive Coding
Support for R analytics – Statistics, mining, modeling
– Apply native R model-based scoring
36
* @input InputStream The name of the stream to c… * @param minuteSink String identifying the file nam… * @param totalSink String identifying the file name…
Streams 3.1 Enhancements: Integration
Teradata and Aster support – Additional data stores
JMS support – Java Message Service
– Supports WebSphere MQ, Apache ActiveMQ
37
© 2013 IBM Corporation 38
For Big Data, InfoSphere Streams is the Clear Choice
BI / Reporting
BI / Reporting
Exploration / Visualization
Functional App
Industry App
Predictive Analytics
Content Analytics
Analytic Applications
IBM Big Data Platform
Systems Management
Application Development
Visualization & Discovery
Accelerators
Information Integration & Governance
Hadoop System
Stream Computing
Data Warehouse
Big Data in Real-Time – Ingest, analyze and act on massive
volumes of streaming data
– Faster AND more cost-effective for
specific use cases
Fit for purpose Streaming Analytics – Analyzes a variety of data types, in their
native format – text, geospatial, time
series, video, audio & more
Enterprise Class – Reliable, scalable, secure
– Ease of use with admin and development
UIs
Integration – Fits into existing information
management architecture
– Pre-integrated analytic applications
© 2013 IBM Corporation 39
Use Case Example
Telecommunications:
– Simultaneous processing,
filtering, and analysis of Call
Detail Records (CDRs) in real
time
© 2013 IBM Corporation 40
University of Ontario
Institute of Technology
(UOIT) Detects Neonatal
Patient Symptoms Sooner
• Performing real-time analytics using
physiological data from neonatal babies
• Continuously correlates data from medical
monitors to detect subtle changes and alert
hospital staff sooner
• Early warning gives caregivers the ability to
proactively deal with complications
Significant benefits:
• Helps detect life threatening conditions up
to 24 hours sooner
• Lower morbidity and improved patient care
Capabilities Utilized:
Stream Computing
“Helps detect life threatening conditions up to 24 hours sooner”
40 40
© 2013 IBM Corporation 41 41
TerraEchos Turns to IBM
Big Data for Low Latency
Surveillance Data Analysis
• Deployed security surveillance system to detect,
classify, locate, and track potential threats at
highly sensitive national lab
• Stream computing collects and analyzes acoustic
data from fiber-optic sensor arrays
• Analyzed acoustic data fed into TerraEchos
intelligence platform for threat detection,
classification, prediction & communication
Significant benefits:
• Enables Terraechos solution to analyze and
classify streaming acoustic data in real-time
• Provides lab & security staff with holistic view of
potential threats & non-issues
• Enables a faster and more intelligent response to
any threat
41
Capabilities Utilized:
Stream Computing
“Identifies and classifies potential security threats – miles away” 41
© 2013 IBM Corporation 42 42
Marine Institute of Ireland
Monitors Buoy Sensor Data
to Detect Floods Sooner
• Large amounts of sensor data is collected
from buoys in local bay
• Continuously monitors environmental,
pollution, and local marine life
• Information streamed to institutes’ central
monitoring and analysis system
• Users can access, aggregate, analyze and
set up automated alerts using web portal
Significant benefits:
• Prove adherence to regulations protecting
marine mammals
• Faster and more accurate flood prediction
• Pollution and location-based debris tracking
to increase public safety
42
Capabilities Utilized:
Stream Computing
“Monitor and protect marine mammal life in real-time”
© 2013 IBM Corporation June 25, 2013
Questions?
E-mail: [email protected]
Subject: Big data bootcamp