51
www.pervasivebigdata.com Pervasive Partner Presentation KNIME + DataRush Mike Hoskins, GM - Pervasive Big Data KNIME Conf, Zurich Technopark, 1 Feb 2012

Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

www.pervasivebigdata.com

Pervasive Partner Presentation

KNIME + DataRush Mike Hoskins, GM - Pervasive Big Data

KNIME Conf, Zurich Technopark, 1 Feb 2012

Page 2: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Big Data Pipeline

Data Scientists

Data Analysts

Business Analysts

Decision Makers

Operational Intelligence

Data Integrators

App Developers

Prepare

profile match

cleanse aggregate

audit

Analyze sample model

discover visualize predict

Consume report chart

dashboard alert

closed loop

Collect

monitor log

ingest event capture

decrypt

Page 3: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Big Data Challenges

Volume

Prepare

profile match

cleanse aggregate

audit

Analyze sample model

discover visualize predict

Consume report chart

dashboard alert

closed loop

Collect

monitor log

ingest event capture

decrypt

Page 4: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

www.pervasivebigdata.com

Pervasive DataRush

Page 5: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Full Core and Memory Utilization

5

Legacy Applications DataRush

• Single Threaded

• In-Memory

• Dynamic Scaling Multi-Threaded

• Full Resource Utilization

• Data Flow

• Overcome Memory Heap Sizes

Page 6: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

© Copyright 2011 Pervasive Software. All rights reserved

Auto-Scaling

370,0

192,4

90,3

51,6

31,5

0,0

50,0

100,0

150,0

200,0

250,0

300,0

350,0

400,0

2 cores 4 cores 8 cores 16 cores 32 cores

Tim

e in

Min

ute

s

Core Count

Run-time

3.2 hours

using 4

cores

1.5 hours

using 8

cores Under 1

hour

using 16

cores

6

Page 7: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

© Copyright 2011 Pervasive Software. All rights reserved

Full-Featured Data Preparation Functions

Page 8: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

© Copyright 2011 Pervasive Software. All rights reserved

Analytics Functions For Deep Insights

Page 9: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

www.pervasivebigdata.com

DataRush & Hadoop

Page 10: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Malstone Benchmark – Logfile Processing

• Web site logs

• 10 billion rows

(nearly 1

terabyte)

• Aggregates

site intrusion

information

Run Time

Tota

l Cost

of

Ow

ners

hip

(TCO

)

• 20-node cluster

• 4 cores per node

• 14 hours

• 32 cores

• single machine

• 31.5 minutes

*www.opencloudconsortium.org/benchmarks

26 X

Difference

!

10

Page 11: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

© Copyright 2011 Pervasive Software. All rights reserved

Malstone Benchmark – Price/Performance

11

Page 12: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

www.pervasivebigdata.com

DataRush & Hadoop & KNIME

Page 13: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Pervasive DataRush Plug-in for KNIME

13

DataRush

Plug-Ins

Drag and

Drop to

call

DataRush

for

KNIME

Retrospective

Analytics

Page 14: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

What’s new since 2011 KNIME Conference

• Major Additions:

– New “DeriveFields” Operator

– Two new Join types from our Hive (SQL in Hadoop) work

• Semi-Join and Anti-Join

– Range Partitioning

• New Functions:

– Many Data Preparation functions

• Hadoop & Big Data Operators:

– Extreme high-performance HBase read/write

– Other Hadoop reader/writers

• Avro, Syslog, Netflow, Flume HBase sink

– KNIME nodes for HBase and HDFS read/write

14

Page 15: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

What’s new since 2011 KNIME Conference (2)

• DataRush v6 (releasing later in 2012)

– Unified API/Composition model for scale-up SMP or scale-out

Clusters

– Full Integration with NextGen MapReduce (DataRush as

embedded dataflow computational alternative to coarse-grained

MapReduce programming)

• DataRush for KNIME integration

– Continue the Krunner work (high-speed execution of

contiguous DataRush nodes in a KNIME flow); make it work for

DDR6 (Distributed DataRush v6, summer 2012)

– Standalone server or cluster execution of KNIME flows that

contain only DataRush nodes

15

Page 16: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

www.pervasivebigdata.com

Pervasive Big Data Stack

Page 17: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

© Copyright 2011 Pervasive Software. All rights reserved

Azure

BigTable…

Pervasive

Big Data

Profiler

Pervasive

Big Data

Matcher

Moving from SDK to Consumable Products

17

Pervasive

Big

Miner

Telecom

Analyzer

Pervasive

Big ETL

SCADA

manufacturi

ng

Cyber

security Marketing/

advertising

Pervasive

BigOLAP

Time series, event, analytics

Platform

Tools

Products

Solutions

Pervasive DataRush

Big Data Integration and Analytics Platform

Hardware

• Single server or cluster

• On-premises or in cloud

Data

Sources

• Flat files

• Relational databases

• NoSQL databases

• Hadoop

Pervasive

Big BI

Pervasive

Big Viz

Hadoop add-

ons

(TurboRush)

Eco system add-ons

Page 18: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Big Data (NoSQL)Tools

• TurboRush for HBase

• Big Tooling w/GUI

– BigIntegrator (aka PDI)

– BigETL (aka KNIME)

– BigBI

• Report, Chart, OLAP, Query

– BigMiner (aka KNIME)

Page 19: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Pervasive Data Integrator™ v10

• All Service Oriented / ESB

• Browser-based UI

• Deploy On-premises or Cloud

• Extensible and Embeddable

• New management capabilities

WEB INTERFACE

Drag and drop palette Flexible workflow Auto or drag and map

Page 20: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

© Copyright 2011 Pervasive Software. All rights reserved

Predictive Analytics in DataRush for KNIME

20

Page 21: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Big Data Capture and Analysis for Telecom

Customer Churn

Network Performance

Fraud detection

Revenue Assurance

Customer Experience

Least-Cost Routing

Vendor Performance

SaaS apps

Server/Web/App

logs

In-house apps

Sensors/Switches/

Routers

Partner data

Flume,

Snort,

Esper

Collect Prepare Analyze

Monitor

Decrypt

Add timestamps

Log receipt

Store CSV, XLS

Store HDFS, Hbase

Event ingest

Page 22: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

What does it mean? Where is the fit good?

• KNIME is ready for Big Data! Just add DataRush

– Extreme scaling on modern commodity hardware: scale-up on

Servers/Appliances, and scale-out on Clusters

– Native support for Hadoop and NoSQL

• Use cases already worked with DataRush for KNIME

– Telecomms CDR (Call Detail Records)

– Cybersecurity (Network and Weblog analytics)

– Life Sciences (Gene alignment and assembly)

– Financial Services and Healthcare

– General Data Mining (Clustering, Linear Regression, Decision Tree)

– Almost no limit to the use cases

• Well suited for:

– Machine generated “event” data (aka: log events)

– Long-running Analytic workloads (including Matching)

– Heavy “Data Prep” pre-processing

• Lacking Operators (today) for text, multimedia

22

Page 23: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

www.pervasivebigdata.com

Thanks! Q&A

Page 24: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

© Copyright 2011 Pervasive Software. All rights reserved

Big Data Benchmarks on Hadoop

24

• Developed by the Open Cloud Consortium

• Benchmark related to web site visits and cyber infection status

• 10 billion row dataset with 100 bytes/row for a total of 1 Terabyte

1. The MalStone Benchmark, TeraSort and Clouds For Data Intensive Computing – Robert Grossman

http://rgrossman.com/2009/05/25/malstone-benchmark. Java code probably not optimized.

2. Subject to further review and potential optimization

3. Early test results – all subject to further optimization

Log file processing – Malstone benchmark

NOT FOR PUBLICATION

Rows/sec Rows/watt Rows/$

20-nodes x 4 cores - Open Cloud Consortium cluster

Grossman (Hadoop + Java MapReduce) 1 187,266 62,422 46,816,479

Single server: 48-core, 64-disk "Hadoop Appliance"

Pervasive 1 - Hadoop + Java MapReduce 2 75,597 88,938 110,630,075

Pervasive 2 - Flat file + DataRush 3 3,267,974 3,844,675 4,782,400,765

Pervasive 3 - HDFS/Hbase + DataRush 3 6,024,096 7,087,172 8,815,750,808

Performance ratio - Pervasive 3 vs Hadoop/MR cluster 32x 114x 188x

Read-only performance - HDFS/Hbase + DataRush 3 12,800,000 15,058,824 18,731,707,317

Page 25: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Hadoop

Structured

Data

Events

ERP

CRM

APPs

Devices

Syslog

Event

Collection

Framework

Collector

Collector

HBase

End User Tools

Aggregates

(RDBMS)

OLAP

Engine

Data

Prep

Real-time Visualization

Reporting

OLAP

Data Mining

ETL

HBase Sink

HBase Sink SQL/MED

JDBC

XMLA

KNIME Wrapper

Query

Big Data Platform

HDFS

ETL

Integration

Page 26: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

www.pervasivebigdata.com

Big Data Solutions

Page 27: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Telecom Provider Challenges

Switches /

Network Elements

Off-net Usage OSS/BSS Data

Corporate

Sales/Marketing

Network OPS

Customer Care

Information Technology Vendor Performance

Pricing optimization

Product/Service

Offers

Operational

Performance

Profitability Analysis

Customer Experience

Capacity Optimization

Network Performance

Churn

Segment Insights

Usage Trends

Continuously

Integrate

Problem Solving

Page 28: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Pervasive DataRush™

28

DataRush is a parallel dataflow platform that eliminates

performance bottlenecks in your data-intensive applications

• Scalable

• High Throughput

• Cost Efficient

• Easy to Implement

• Extensible

Page 29: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Business Issues

• Time to decision is critical

– Missed opportunities; wasted resources

– Customer issue reaction is too slow

• Deeper granularity of data is critical

– Understanding of trends is needed

– Pricing optimization

– Vendor performance

• Decision time - from days to minutes

– Deeper understanding of operational issues

– Which situations are problematic (or not)

Page 30: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Pervasive DataRush and Hadoop

• DataRush embedded within Hadoop

– Reduce complexities of MapReduce experience

– Increased efficiencies = significantly faster run times

– Cloudera Certification

Mapper Mapper Mapper Mapper

Reducer Reducer

Hadoop

Distributed

File System

DataRush DataRush DataRush DataRush

DataRush DataRush

33

mins

135

mins

Malstone B

0.5 TB

DataRush in Hadoop

Hadoop

30

Page 31: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Pervasive DataRush™

31

DataRush is a parallel dataflow platform that eliminates performance bottlenecks in your data-intensive applications

• Scalable: Performance dynamically scales with increased core/server

counts. No change to the code.

• High Throughput: Patented parallel dataflow technology enables fast,

deep analysis of large data sets with no limit on input data size.

• Cost Efficient: Fully exploit commodity multicore servers – save

significant capital and energy costs via efficient node utilization.

• Easy to Implement: DataRush takes care of complex parallel

processing issues at design time: hides threading complexity; no

deadlocks; runs on any platform – including Hadoop; etc..

• Extensible: DataRush is a component-based platform with an open API

so you can easily extend it for your own needs.

Page 32: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

© Copyright 2011 Pervasive Software. All rights reserved

DataRush Release Timeline

CQ1-2011 CQ2-2011 CQ3-2011 CQ4 2011 CQ1 2012 CQ2 2012

DataRush 5.0 • Distributed DR

• KNIME

• Performance

DataRush 5.0.1 • Bug fixes

• Targeted features

DataRush 5.1 • Hadoop and Hive integration

• I-Labs connectivity

• KNIME 2.4.1

• Bug fixes

(January 2011)

(March 2011, ongoing …)

(December 2011)

DataRush 6 • Fully distributed composition

and library

• Distributed execution in KNIME

• Next Gen MapReduce (?)

(TBD)

TurboRush for Hive 0.9 • Hive accelerator

• Limited release

Page 33: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

www.pervasivebigdata.com

DataRush & KNIME

Page 34: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

KNIME Introduction

• Open source workflow for data mining

• Desktop designer

– Eclipse based (RCP app and plug-in)

– Node based architecture

• Nodes provide connectivity, transformations, algorithms, …

• Extensible model: user developed nodes supported

– Drag and drop, graphical editing of projects

– Project execution from GUI

– Workflow model – each node executes completely

before next node is invoked

Page 35: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

© Copyright 2011 Pervasive Software. All rights reserved

Predictive Analytics in DR-KNIME

35

Page 36: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

© Copyright 2011 Pervasive Software. All rights reserved

Profiling in DR-KNIME

36

Page 37: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

www.pervasivebigdata.com

NextGen Sequencing and

Genomic Pipelines

Page 38: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

NGS data explosion

38

Page 39: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Convert/filter FastA/FastQ files

39

Page 40: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Align/order/assemble

40

Page 41: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Report/visualize matching/coverage

41

Page 42: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

www.pervasivebigdata.com

Q & A

Page 43: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

www.pervasivebigdata.com

Big Data Products

Page 44: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Pervasive Big Data (NoSQL)Tools

• TurboRush for HBase

• Big Tooling w/GUI

– BigIntegrator

– BigBI

• Rpt, Cht, OLAP, Qry

– BigMiner

– BigSearch

Page 45: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

BigIntegrator: HBase as Source or Target

45

Page 46: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

BigIntegrator: Visual Mapping to/from HBase

46

Page 47: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

BigBI (aka BigQuery)

47

Page 48: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

www.pervasivebigdata.com

DataRush & KNIME

Page 49: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

DataRush + KNIME – what is it?

• Plug-in of DataRush v5.1 to KNIME v3.2?

• Adds extreme high-performance data preparation

and analytic functions

• Adds support for Hadoop data sources (both

HDFS and Hbase)

• Adds special dataflow “k-runner” mode that

recognizes adjacent DataRush nodes and

executes entirely in memory by “flowing” data

from node to node

• KNIME functionality can be further extended with

the DataRush SDK and Scripting

Page 50: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Pervasive RushMiner

Visual Environment for Big Data Analytics and Preparation

• Quickly cleanse, profile and aggregate big data

• Use Data mining, predictive analytics, machine learning to uncover actionable

intelligence

• Works with flat files, relation databases, NoSQL databases, and Hadoop filesystem

(HDFS)

• High performance, scales up to terabytes of data

• Design on your desktop using simple drag-and-drop interfaceExecute on desktop,

remote server, or clusters --including Hadoop clusters

50

Page 51: Pervasive Partner Presentation KNIME + DataRush · 2017-05-23 · Big ETL SCADA manufacturi ng Cyber security Marketing/ advertising Pervasive BigOLAP Time series, event, analytics

Event Processing with DataRush

• Capture ALL data

• Discover previously unavailable patterns, correlations, etc.

• Scalable to meet growing needs

Processed 100 Million Syslog events in 58 seconds on a 48 core system. A sustained run rate of 14 Tb per day

51