47
Michael R. Berthold University of Konstanz, Germany KNIME.com AG, Switzerland The Berkeley R Language Beginner Study Group Nov 19, 2013 R and KNIME: The Best of Two Worlds.

R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Michael R. Berthold University of Konstanz, Germany

KNIME.com AG, Switzerland

The Berkeley R Language Beginner Study Group

Nov 19, 2013

R and KNIME: The Best of Two Worlds.

Page 2: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Agenda

• KNIME Overview • Demo / Intro • Interactive R Nodes • A few Examples • Q&A

Page 3: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

A Brief History of KNIME 2004: KNIME development commences 2006: KNIME v1 released 2006: Spin-off in Konstanz, Germany 2006-2007: First commercial partners 2008: KNIME moves to Zurich 2010: Enterprise products released 2011: KNIME.com AG founded 2013: KNIME comes to the West Coast… +3000 Organizations Using KNIME

~30% Life Science ~70% Business Intelligence, Analytics +50 Very Active Community Developers

3

„KNIME saved my life in a world of scripts

that I do not want to learn!“ 2012

Page 4: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Who’s Using KNIME?

• >17.000 Individuals • ~3.000 Organizations world wide • ~300 KNIME.com Customers

Page 5: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

The KNIME Platform

Page 6: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

KNIME loads and integrates data from diverse data sources: • Different data bases • Various file formats (CSV, XML, SDF, etc.)

Page 7: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

KNIME provides huge repository of modules for easy-to-use, modular • Data preprocessing • Data fusion • Data transformation

Page 8: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

In addition to standard data mining techniques, KNIME adds cutting edge data analysis algorithms. (…thanks to its academic roots)

Page 9: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Interactive views provide data overviews and insights into the learned models. Interactive linking&brushing techniques allow for powerful exploration of models and data.

Page 10: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

KNIME

Due to its open API and “node-in-a-sandbox”-approach additional (also external) tools are easily integrated,

e.g. • Access to the statistics tool R • Complete integration of the machine learning

library WEKA • Application area specific integration, e.g. CDK

(Chemical Development Kit), RDKit, ImageJ, … KNIME is Eclipse-based: Integrating other Eclipse projects such as BIRT, DTP, etc. provides even more functionality

Page 11: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

KNIME Selected Node Highlights

Statistics Data Mining Time Series Image Processing Neighborgrams Web Analytics Text Mining Network Analysis Social Media Analysis WEKA R

Database Support ETL Text Processing Data Generation XML Read/Write PMML Read / Write Social Media Analysis Business Intelligence Community Nodes 3rd Party Nodes

11

Over 1000 native and imbedded nodes included:

Advanced Visualization

Page 12: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Community Contributors

Technology Partners

Distribution & Consulting Partners

Community Contributors

Community User Base

Academic Instiutions: • Universität Tübingen (BALL, OpenMS) • Freie Universität Berlin (SeqAn) • MPI Dresden (ImgLib) • Universität Dresden (Palladin) • ETH Zürich (OpenBIS) • Dublin University (OMERO) • University of Wisconsin (ImageJ2) • … Commercial Contributors: • Dymatrix Consulting Group (Uplift Nodes) • Eli Lilly (ChemInf suite) • Novartis (RDKit, Indigo) • Vernalis (Proteomics) • Cenix (SOAP Nodes) • Böhringer-Ingelheim (various sponsored nodes) • …

Page 13: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Community User Base

Technology Partners

Distribution &Consulting Partners

Community Contributors

Community User Base

0

50

100

150

200

Oct-06 Feb-08 Jul-09 Nov-10 Apr-12 Aug-13

Annual User Group Meeting Attendees

Dr. Rosaria Silipo (consultant) Simon Richards

(Eli Lilly)

Mike Mazanetz (Evotec)

Page 14: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

What can I do with KNIME?

Page 15: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Standardization

Page 16: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Data Integration

Page 17: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Tool Integration – Version A

Page 18: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Tool Integration – Version B

Page 19: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Big Data: Clustering Meter IDs

30 clusters with k-Means on average daily,

monthly, hourly, ... kW values

Average hourly time series cluster by

cluster

Page 20: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

KNIME and Big Data

• Big ETL • Big Analytics • Big Data(bases)

Page 21: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

What else is KNIME used for?

And more…: • Next Best Offer • Survey Analysis • (Big) Time Series Data • …

Page 22: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Commercial KNIME (Attention – sales pitch!)

Tools for Collaboration: • KNIME TeamSpace • KNIME Server • Training, Consulting, and Custom

Development.

Page 23: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Standardization: KNIME TeamSpace at Work

Page 24: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Standardization: The KNIME Server in it’s element

Page 25: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Resources http://www.knime.org/learning-hub • Links to Guides, White Papers,

Documentation, and the KNIME YouTube Channel

• Tons of example workflows! http://www.knime.org/knimepress • Books for Beginners, Advanced

KNIME Users, and SAS Users.

Free Beginner’s Guide – use Code

“meetupsf13”

The R in KNIME Webinar: http://www.youtube.com/watch?v=wCvnO96d8h4

Page 26: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Demo.

Page 27: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

27

Why use KNIME and R?

• Powerful statistics • Leading edge algorithms

• Powerful/flexible

graphics

• Widely accepted language

• Powerful user interface

• Designed for big data

• Integrates com and org tools

• Enterprise grade solutions

• Open source analytics

• Cross platform

• Vibrant communities

R KNIME

Page 28: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

28

R in KNIME: 3 ways to play…

• Community

(RServe Integration)

• Core (Deprecated soon)

• R Interactive (Today's topic)

Page 29: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Overview of R (Interactive)

• Different input and output options • Grey ports enable workspace branching

Page 30: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

The Interactive Editor

Columns

Variables Code Editor

Workspace Overview

Console

Page 31: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Templates

Preview

List

Summary

Page 32: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Node: R Source

• Get data from an R data frame

• Assign output to knime.out

• Use with foreign, RCurl, or ...

Page 33: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Node: R Snippet

• Generic data manipulation

• Derive knime.out from knime.in

• Use with grep(), plyr, or ...

Page 34: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Nodes: R Mining

• Use R models in KNIME

• Learner & Predictor motif

• PMML support for portability

Page 35: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Nodes: R View

• Generic R plots

• Plot(knime.in)

• Use with many packages including ggplot2

Page 36: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Metanodes and R: Quickforms

Page 37: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Metanodes and R: Deployment

• Abstract: Configure w/ simple dialog

• Share (TeamSpace/Server)

• Deploy (KNIME Webportal)

Page 38: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Embedding plots in BIRT

• Generate plots in R • Send to BIRT

Page 39: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

EQPOL Data with Bioconductor I • External Quality Assurance Program Oversight Laboratory • NIH, NIAID, DAIDS program for QA of HIV/AIDS research • Can machine learning automate some manual analysis? • Problem: Lots of real data (~100,000,000 rows) • Bioconductor provides flowCore to make this easier

Page 40: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

EQPOL Data with Bioconductor II

Page 41: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

(Node) Development

Page 42: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

KNIME Data Management and Execution Layer

KNIME Workflow Manager & User Interface

Execution Control Meta Data Handling

Data Management

KNIME I/O

KNIME Native

Algorithms

Open Source Integrations (R, BIRT, …)

Partner Extensions

Node Interface

Node Interface

Node Interface

Node Interface

Community Extensions

Node Interface

Data Mgmt &

Execution Ctrl

Data Mgmt &

Execution Ctrl

Data Mgmt &

Execution Ctrl

Data Mgmt &

Execution Ctrl

Data Mgmt &

Execution Ctrl

Clus

ter

Exec

utio

n

Mul

ti C

ore

Exec

utio

n

Dis

trib

uted

D

ata

Stor

age

Dis

trib

uted

Ex

ecut

ion

In M

emor

y D

ata

Han

dlin

g

Auto

mat

ic

Dat

a Ca

chin

g

KNIME Platform: Technology Overview D

ata

Type

Ex

tens

ions

Page 43: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Node Architecture

KNIME interacts only with a Node

Node takes care of

embedding the node in the infrastructure

New nodes implement

Model/View/Dialog

class Node (final)

class Node-

Dialog- Pane

(abstract)

class Node- View

(abstract)

class Node- Model

(abstract)

class NodeFactory (abstract)

Page 44: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Node Extension Wizard

• Included in the KNIME Developer Version

• Allows creation of plugin projects including functioning KNIME nodes (with sample code)

• Helpful to easily create all node classes – Generates all Java classes – Node is registered with the plugin project – Launch KNIME and enjoy the new node working!

Page 45: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Node Extension Wizard

Page 46: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Node Extension Wizard • Specify all settings to

create a new KNIME node – In a completely new plugin

project, or – Into an existing project

• Node type: Sink, Source, Learner, Predictor, Manipulator, Visualizer, Meta, or Other

• Include sample code or not

Page 47: R and KNIME: The Best of Two Worlds.files.meetup.com/3182622/2013_11_19_BerkeleyR_Meetup.pdf · R Database Support . ETL . Text Processing . Data Generation . XML Read/Write . PMML

Node Extension Wizard • Contains all Java

classes (including sample code)

• Node is registered in the plugin.xml

• NodeDialog and NodeView class are also created and registered to the NodeFactory