The datascientists workplace of the future, IBM developerDays 2014, Vienna by Romeo Kienzler

Preview:

DESCRIPTION

 

Citation preview

© 2013 IBM Corporation1

The Data Scientists Workplace of the Future - Workshop SwissRE, 11.6.14

Romeo Kienzler

IBM Center of Excellence for Data Science, Cognitive Systems and BigData(A joint-venture between IBM Research Zurich and IBM Innovation Center DACH)

Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg

© 2013 IBM Corporation2

The Data Scientists Workplace of the Future -

* * C R E D I T S * *

Romeo Kienzler

IBM Innovation Center

● Parts of these slides have been copied from and/or revised by● Dr. Anand Ranganathan, IBM Watson Research Lab● Dr. Stefan Mück, IBM BigData Leader Europe● Dr. Berthold Rheinwald, IBM Almaden Research Lab● Dr. Diego Kuonen, Statoo Consulting● Dr. Abdel Labbi, IBM Zurich Research Lab● Brandon MacKenzie, IBM Software Group

© 2013 IBM Corporation3

What is DataScience?

Source: Statoo.com http://slidesha.re/1kmNiX0

© 2013 IBM Corporation4

What is DataScience?

Source: Statoo.com http://slidesha.re/1kmNiX0

© 2013 IBM Corporation5

DataScience at present● Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html)

● SQL (42%)● R (33%)● Python (26%)● Excel (25%)● Java, Ruby, C++ (17%)● SPSS, SAS (9%)

● Limitations (Single Node usage)● Main Memory● CPU <> Main Memory Bandwidth● CPU ● Storage <> Main Memory Bandwidth (either Single node or SAN)

© 2013 IBM Corporation6

DataScience at present - Demo● Assume 1 TB file on Hard Drive● Spit into 16 files

● split -d -n 16 output.json● Distribute on 4 Nodes

● for node in `seq 1 16`; do scp x$node id@node$i:~/; done● Perform calculation in paralell

● for node in `seq 1 16`; do ssh id@node$i 'cat $file

|awk -F":" '{print $6}' |grep -i samsung|grep breathtaking |wc -l';

done > result● Merge Result

● cat result |sumSource: http://sergeytihon.wordpress.com/2013/03/20/the-data-science-venn-diagram/

© 2013 IBM Corporation7

What is BIG data?

© 2013 IBM Corporation8

What is BIG data?

© 2013 IBM Corporation9

What is BIG data?

Big Data

Hadoop

© 2013 IBM Corporation10

What is BIG data?

Business Intelligence

Data Warehouse

© 2013 IBM Corporation11

BigData == Hadoop?

Hadoop BigData

Hadoop

© 2013 IBM Corporation12

What is beyond “Data Warehouse”?

Data Lake

Data Warehouse

© 2013 IBM Corporation13

First “BigData” UseCase ?● Google Index

● 40 X 10^9 = 40.000.000.000 => 40 billion pages indexed● Will break 100 PB barrier soon● Derived from MapReduce● now “caffeine” based on “percolator”

● Incremental vs. batch● In-Memory vs. disk

© 2013 IBM Corporation14

Map-Reduce → Hadoop → BigInsights

© 2013 IBM Corporation15

BigData UseCases● CERN LHC

● 25 petabytes per year● Facebook

● Hive Datawarehouse● 300 PB, Growing 600 TB / d● > 100 k servers

● Genomics● Enterprises

● Data center analytics (Logflies, OS/NW monitors, ...)● Predictive Maintenance, Cybersecurity

● Social Media Analytics● DWH offload● Call Detail Record (CDR) data preservation

http://www.balthasar-glaettli.ch/vorratsdaten/

© 2013 IBM Corporation1616

Why is Big Data important?

© 2013 IBM Corporation17

BigData Analytics

Source: http://www.strategy-at-risk.com/2008/01/01/what-we-do/

© 2013 IBM Corporation18

BigData Analytics – Predictive Analytics

"sometimes it's not who has the best algorithm that wins; it's who has the most data."

(C) Google Inc.

The Unreasonable Effectiveness of Data¹

¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf

No Sampling => Work with full dataset => No p-Value/z-Scores anymore

© 2013 IBM Corporation19

We need Data Parallelism

© 2013 IBM Corporation20

Aggregated Bandwith between CPU, Main Memory and Hard Drive

1 TB (at 10 GByte/s)

- 1 Node - 100 sec

- 10 Nodes - 10 sec

- 100 Nodes - 1 sec

- 1000 Nodes - 100 msec

© 2013 IBM Corporation21

Fault Tolerance / Commodity Hardware

AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,

3TB SEAGATE Barracuda 7200.14

< CHF 500

100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD

MTBF ~ 365 d > 1,5 d

Source: http://www.cloudcomputingpatterns.org/Watchdog

© 2013 IBM Corporation22

NoSQL Databases Column Store

– Hadoop / HBASE– Cassandra– Amazon Simple DB

JSON / Document Store– MongoDB– CouchDB

Key / Value Store– Amazon DynamoDB– Voldemort

Graph DBs– DB2 SPARQL Extension– Neo4J

MP RDBMS– DB2 DPF, DB2 pureScale, PureData for Operational Analytics– Oracle RAC– Greenplum

http://nosql-database.org/ > 150

© 2013 IBM Corporation23

CAP Theorem / Brewers Theorem¹ impossible for a distributed computer system simultaneously guarantee all 3 properties

– Consistency (all nodes see the same data at the same time)– Availability (guarantee that every request knows whether it was successful or failed)– Partition tolerance (continues to operate despite failure of part of the system)

What about ACID?– Atomicity– Consistency– Isolation– Durability

BASE, the new ACID– Basically Available– Soft state– Eventual consistency

• Monotonic Read Consistency• Monotonic Write Consistency

• Read Your Own Writes

© 2013 IBM Corporation24

What role is the cloud playing here?

© 2013 IBM Corporation25

“Elastic” Scale-Out

Source: http://www.cloudcomputingpatterns.org/Continuously_Changing_Workload

© 2013 IBM Corporation26

“Elastic” Scale-Out

of

© 2013 IBM Corporation27

“Elastic” Scale-Out

of

CPU Cores

© 2013 IBM Corporation28

“Elastic” Scale-Out

of

CPU Cores Storage

© 2013 IBM Corporation29

“Elastic” Scale-Out

of

CPU Cores Storage Memory

© 2013 IBM Corporation30

“Elastic” Scale-Out

linear

Source: http://www.cloudcomputingpatterns.org/Elastic_Platform

© 2013 IBM Corporation31

How do Databases Scale-Out?

Shared Disk Architectures

© 2013 IBM Corporation32

How do Databases Scale-Out?

Shared Nothing Architectures

© 2013 IBM Corporation33

Hadoop?

Shared Nothing Architecture?

Shared Disk Architecture?

© 2013 IBM Corporation34

Data Science on Hadoop

SQL (42%)

R (33%)

Python (26%)

Excel (25%)

Java, Ruby, C++ (17%)

SPSS, SAS (9%)

Data Science Hadoop

© 2013 IBM Corporation35

Large Scale Data Ingestion● Traditionally

● Crawl to local file system (e.g. wget http://www.heise.de/newsticker/)● Export RDBMS data to CSV (local file system)● Batched FTP Servers uploads● Then: Copy to HDFS

● BigInsights● Use one of built-in importers● Imports directly info HDFS● Use Eclipse-Tooling to deploy custom importers easily

© 2013 IBM Corporation36

Large Scale Data Ingestion (ETL on M/R)● Modern ETL (Extract, Transform, Load) tools support Hadoop as

● Source, Sink (HDFS)● Engine (MapReduce)● Example: InfoSphere DataStage

© 2013 IBM Corporation37

Real-Time/ In-Memory Data Ingestion● If volume can be reduced dramatically during first processing steps

● Feature Extraction of● Video● Audio● Semistructured Text (e.g. Logfiles)● Structured Text

● Filtering● Compression

● Recommendation: Usage of Streaming Engines● IBM InfoSphere Streams● Twitter Storm (now Apache incubator)● Apache Spark Streaming

© 2013 IBM Corporation38

Real-Time/ In-Memory Data Ingestion● If volume can be reduced dramatically during first processing steps

● Feature Extraction of● Video● Audio● Semistructured Text (e.g. Logfiles)● Structured Text

● Filtering● Compression

© 2013 IBM Corporation39

SQL on Hadoop● IBM BigSQL (ANSI 92 compliant)● HIVE (SQL dialect)● Cloudera Impala ● Lingual● ...

SQL Hadoop

© 2013 IBM Corporation40

BigSQL V3.0 – ANSI SQL 92 compliantIBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification. Source: http://www.ibmbigdatahub.com/blog/big-deal-about-infosphere-biginsights-v30-big-sql

© 2013 IBM Corporation41

BigSQL V3.0 – Architecture

© 2013 IBM Corporation42

BigSQL V3.0 – Demo (small)● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

© 2013 IBM Corporation43

BigSQL V3.0 – Demo (small)CREATE EXTERNAL TABLE trace (

hour integer, employeeid integer,

departmentid integer, clientid integer,

date string, timestamp string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/biadmin/32Gtest';

select count(hour), hour from trace group by hour order by hour

-- This command runs on 32 GB / ~650.000.000 rows in HDFS

© 2013 IBM Corporation44

BigSQL V3.0 – Demo (small)

© 2013 IBM Corporation45

BigSQL V3.0 – Demo (small)

© 2013 IBM Corporation46

R on Hadoop● IBM BigR (based on SystemML Almadan Research project)● Rhadoop● RHIPE● ...

“R” Hadoop

© 2013 IBM Corporation47

BigR (based on SystemML)Example: Gaussian Non-negative Matrix Factorization

package gnmf;

import java.io.IOException;import java.net.URISyntaxException;

import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.JobConf;

public class MatrixGNMF{ public static void main(String[] args) throws IOException, URISyntaxException { if(args.length < 10) { System.out.println("missing parameters"); System.out.println("expected parameters: [directory of v] [directory of w] [directory of h] " + "[k] [num mappers] [num reducers] [replication] [working directory] " + "[final directory of w] [final directory of h]"); System.exit(1); } String vDir = args[0]; String wDir = args[1]; String hDir = args[2]; int k = Integer.parseInt(args[3]); int numMappers = Integer.parseInt(args[4]); int numReducers = Integer.parseInt(args[5]); int replication = Integer.parseInt(args[6]); String outputDir = args[7]; String wFinalDir = args[8]; String hFinalDir = args[9]; JobConf mainJob = new JobConf(MatrixGNMF.class); String vDirectory; String wDirectory; String hDirectory; FileSystem.get(mainJob).delete(new Path(outputDir)); vDirectory = vDir; hDirectory = hDir; wDirectory = wDir; String workingDirectory; String resultDirectoryX; String resultDirectoryY; long start = System.currentTimeMillis(); System.gc(); System.out.println("starting calculation"); System.out.print("calculating X = WT * V... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_H, vDirectory, wDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = WT * W * H... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, wDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_H, hDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating H = H .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, hDirectory, resultDirectoryX, resultDirectoryY, hFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back H... "); FileSystem.get(mainJob).delete(new Path(hDirectory)); hDirectory = workingDirectory; System.out.println("done"); System.out.print("calculating X = V * HT... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_W, vDirectory, hDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = W * H * HT... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, hDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_W, wDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating W = W .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, wDirectory, resultDirectoryX, resultDirectoryY, wFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back W... "); FileSystem.get(mainJob).delete(new Path(wDirectory)); wDirectory = workingDirectory; System.out.println("done"); long requiredTime = System.currentTimeMillis() - start; long requiredTimeMilliseconds = requiredTime % 1000; requiredTime -= requiredTimeMilliseconds; requiredTime /= 1000; long requiredTimeSeconds = requiredTime % 60; requiredTime -= requiredTimeSeconds; requiredTime /= 60; long requiredTimeMinutes = requiredTime % 60; requiredTime -= requiredTimeMinutes; requiredTime /= 60; long requiredTimeHours = requiredTime;}}

package gnmf;

import gnmf.io.MatrixObject;import gnmf.io.MatrixVector;import gnmf.io.TaggedIndex;

import java.io.IOException;import java.util.Iterator;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.SequenceFileInputFormat;import org.apache.hadoop.mapred.SequenceFileOutputFormat;

public class UpdateWHStep2{ static class UpdateWHStep2Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixVector, TaggedIndex, MatrixVector> { @Override public void map(TaggedIndex key, MatrixVector value, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { out.collect(key, value); } } static class UpdateWHStep2Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixVector, TaggedIndex, MatrixObject> { @Override public void reduce(TaggedIndex key, Iterator<MatrixVector> values, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { MatrixVector result = null; while(values.hasNext()) { MatrixVector current = values.next(); if(result == null) { result = current.getCopy(); } else { result.addVector(current); } } if(result != null) { out.collect(new TaggedIndex(key.getIndex(), TaggedIndex.TYPE_VECTOR_X), new MatrixObject(result)); } } } public static String runJob(int numMappers, int numReducers, int replication, String inputDir, String outputDir) throws IOException { String workingDirectory = outputDir + System.currentTimeMillis() + "-UpdateWHStep2/";

JobConf job = new JobConf(UpdateWHStep2.class); job.setJobName("MatrixGNMFUpdateWHStep2"); job.setInputFormat(SequenceFileInputFormat.class); FileInputFormat.setInputPaths(job, new Path(inputDir)); job.setOutputFormat(SequenceFileOutputFormat.class); FileOutputFormat.setOutputPath(job, new Path(workingDirectory)); job.setNumMapTasks(numMappers); job.setMapperClass(UpdateWHStep2Mapper.class); job.setMapOutputKeyClass(TaggedIndex.class); job.setMapOutputValueClass(MatrixVector.class); job.setNumReduceTasks(numReducers); job.setReducerClass(UpdateWHStep2Reducer.class); job.setOutputKeyClass(TaggedIndex.class); job.setOutputValueClass(MatrixObject.class); JobClient.runJob(job); return workingDirectory;

}}

package gnmf;

import gnmf.io.MatrixCell;import gnmf.io.MatrixFormats;import gnmf.io.MatrixObject;import gnmf.io.MatrixVector;import gnmf.io.TaggedIndex;

import java.io.IOException;import java.util.Iterator;

import org.apache.hadoop.filecache.DistributedCache;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.SequenceFileInputFormat;import org.apache.hadoop.mapred.SequenceFileOutputFormat;

public class UpdateWHStep1{ public static final int UPDATE_TYPE_H = 0; public static final int UPDATE_TYPE_W = 1; static class UpdateWHStep1Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixObject, TaggedIndex, MatrixObject> { private int updateType; @Override public void map(TaggedIndex key, MatrixObject value, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { if(updateType == UPDATE_TYPE_W && key.getType() == TaggedIndex.TYPE_CELL) { MatrixCell current = (MatrixCell) value.getObject(); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_CELL), new MatrixObject(new MatrixCell(key.getIndex(), current.getValue()))); } else { out.collect(key, value); } } @Override public void configure(JobConf job) { updateType = job.getInt("gnmf.updateType", 0); } } static class UpdateWHStep1Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixObject, TaggedIndex, MatrixVector> { private double[] baseVector = null; private int vectorSizeK; @Override public void reduce(TaggedIndex key, Iterator<MatrixObject> values, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { if(key.getType() == TaggedIndex.TYPE_VECTOR) { if(!values.hasNext()) throw new RuntimeException("expected vector"); MatrixFormats current = values.next().getObject(); if(!(current instanceof MatrixVector)) throw new RuntimeException("expected vector"); baseVector = ((MatrixVector) current).getValues(); } else { while(values.hasNext()) { MatrixCell current = (MatrixCell) values.next().getObject(); if(baseVector == null) { out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), new MatrixVector(vectorSizeK)); } else { if(baseVector.length == 0) throw new RuntimeException("base vector is corrupted"); MatrixVector resultingVector = new MatrixVector(baseVector); resultingVector.multiplyWithScalar(current.getValue()); if(resultingVector.getValues().length == 0) throw new RuntimeException("multiplying with scalar failed"); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), resultingVector); } } baseVector = null; } } @Override public void configure(JobConf job) { vectorSizeK = job.getInt("dml.matrix.gnmf.k", 0); if(vectorSizeK == 0) throw new RuntimeException("invalid k specified"); } } public static String runJob(int numMappers, int numReducers, int replication, int updateType, String matrixInputDir, String whInputDir, String outputDir, int k) throws IOException {

Java Implementation

(>1500 lines of code)

Equivalent SystemML Implementation

(10 lines of code)

Experimenting with multiple variants!

W = W*max(V%*%t(H) – alphaW JW, 0)/(W%*%H%*%t(H))H = H*max(t(W)%*%V – alphaH JH, 0)/(t(W)%*%W%*%H)

W = W*((S*V)%*%t(H))/((S*(W%*%H))%*%t(H))H = H*(t(W)%*%(S*V))/(t(W)%*%(S*(W%*%H)))

W = W*(V/(W%*%H) %*% t(H))/(E%*%t(H))H = H*(t(W)%*%(V/(W%*%H)))/(t(W)%*%E)

© 2013 IBM Corporation48

BigR (based on SystemML)SystemML compiles hybrid runtime plans ranging from in-memory, single machine (CP) to large-scale, cluster (MR) compute

● Challenge● Guaranteed hard memory constraints

(budget of JVM size)● for arbitrary complex ML programs

● Key Technical Innovations● CP & MR Runtime: Single machine & MR operations, integrated runtime● Caching: Reuse and eviction of in-memory objects● Cost Model: Accurate time and worst-case memory estimates● Optimizer: Cost-based runtime plan generation● Dyn. Recompiler: Re-optimization for initial unknowns

Data size

Run

time

CP CP/MR MR

Gradually exploit MR parallelism

High performance computing for small data sizes.

Scalable computing for large data sizes.

Hybrid Plans

© 2013 IBM Corporation49

R Clients

SystemMLStatistics

Engine

Data Sources

Embedded R Execution

IBM R Packages

IBM R Packages

Pull data (summaries) to

R client

Or, push R functions

right on the data

1

2

3

© 2014 IBM Corporation17 IBM Internal Use Only

BigR Architecture

© 2013 IBM Corporation50

BigR Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

© 2013 IBM Corporation51

BigR Demo (small) library(bigr)

bigr.connect(host="bigdata",

port=7052, database="default",

user="biadmin", password="xxx")

is.bigr.connected()

tbr <- bigr.frame(dataSource="DEL", coltypes = c("numeric","numeric","numeric","numeric","character","character"),

dataPath="/user/biadmin/32Gtest", delimiter=",",

header=F, useMapReduce=T)

h <- bigr.histogram.stats(tbr$V1, nbins=24)

© 2013 IBM Corporation52

BigR Demo (small) class bins counts centroids

1 ALL 0 18289280 1.583333

2 ALL 1 15360 2.750000

3 ALL 2 55040 3.916667

4 ALL 3 189440 5.083333

5 ALL 4 579840 6.250000

6 ALL 5 5292160 7.416667

7 ALL 6 8074880 8.583333

8 ALL 7 15653120 9.750000

...

© 2013 IBM Corporation53

BigR Demo (small)

© 2013 IBM Corporation54

BigR Demo (small) jpeg('hist.jpg')

bigr.histogram(tbr$V1, nbins=24)

# This command runs on 32 GB / ~650.000.000 rows in HDFS

dev.off()

© 2013 IBM Corporation55

BigR Demo (small) Sampling, Resampling, Bootstrapping

vsWhole Dataset Processing

What is your experience?

© 2013 IBM Corporation56

Python on Hadoop

python Hadoop

© 2013 IBM Corporation57

SPSS on Hadoop

© 2013 IBM Corporation58

SPSS on Hadoop

© 2013 IBM Corporation59

BigSheets Demo (small)● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)

© 2013 IBM Corporation60

BigSheets Demo (small)

© 2013 IBM Corporation61

BigSheets Demo (small)

This command runs on 32 GB /

~650.000.000 rows in HDFS

© 2013 IBM Corporation62

BigSheets Demo (small)

© 2013 IBM Corporation63

Text Extraction (SystemT, AQL)

© 2013 IBM Corporation64

Text Extraction (SystemT, AQL)

© 2013 IBM Corporation65

If this is not enough? → BigData AppStore

© 2013 IBM Corporation66

BigData AppStore, Eclipse Tooling● Write your apps in

● Java (MapReduce)● PigLatin,Jaql● BigSQL/Hive/BigR

● Deploy it to BigInsights via Eclipse● Automatically

● Schedule● Update

● hdfs files● BigSQL tables● BigSheets collections

© 2013 IBM Corporation67

Questions?

http://www.ibm.com/software/data/bigdata/

Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps

© 2013 IBM Corporation68

DFT/Audio Analytics (as promised)library(tuneR)

a <- readWave("whitenoisesine.wav")

f<- fft(a@left)

jpeg('rplot_wnsine.jpg')

plot(Re(f)^2)

dev.off()

a <- readWave("whitenoise.wav")

f<- fft(a@left)

jpeg('rplot_wn.jpg')

plot(Re(f)^2)

dev.off()

a <- readWave("whitenoisesine.wav")

brv <- as.bigr.vector(a@left)

al <- as.list(a@left)

© 2013 IBM Corporation69

Backup Slides

© 2013 IBM Corporation70

© 2013 IBM Corporation71

© 2013 IBM Corporation72

© 2013 IBM Corporation73

© 2013 IBM Corporation74

© 2013 IBM Corporation75

© 2013 IBM Corporation76

© 2013 IBM Corporation77

© 2013 IBM Corporation78

© 2013 IBM Corporation79

© 2013 IBM Corporation80

© 2013 IBM Corporation81

© 2013 IBM Corporation82

© 2013 IBM Corporation83

© 2013 IBM Corporation84

Map-Reduce

Source: http://www.cloudcomputingpatterns.org/Map_Reduce

© 2013 IBM Corporation85

© 2013 IBM Corporation86

© 2013 IBM Corporation87

© 2013 IBM Corporation88

© 2013 IBM Corporation89

© 2013 IBM Corporation90

© 2013 IBM Corporation91

© 2013 IBM Corporation92

© 2013 IBM Corporation93

© 2013 IBM Corporation94

© 2013 IBM Corporation95

© 2013 IBM Corporation96

© 2013 IBM Corporation97

© 2013 IBM Corporation98

© 2013 IBM Corporation99

© 2013 IBM Corporation100

© 2013 IBM Corporation101

© 2013 IBM Corporation102

© 2013 IBM Corporation103

© 2013 IBM Corporation104

© 2013 IBM Corporation105

© 2013 IBM Corporation106

© 2013 IBM Corporation107

© 2013 IBM Corporation108

Recommended