Analyse von großen Datensätzen in den Lebenswissenschaften ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in_Li… · Taming BIG DATA. oDivide & Conquer oPartition

Analyse von großen Datensätzen in den Lebenswissenschaften (192217)

Christof Schütte & Tim Conrad

Session 6

Overview

• Recap: what happened so far• Scale it up! Flink @ Compute Cluster• Assignments:

• You can resubmit one assignment to get full points

• Next week: • For each assignment: best solution• Start of project

RecapWhere are we?

What Is Big Data?• There is not a consensus as to how to define big data

“Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it with in a tolerable elapsed time for its user population.” - Teradata Magazine article, 2011

“Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.” - The McKinsey Global Institute, 2011

Characteristics of big data

5Source: IBM methodology

Big data spans four dimensions: Volume, Velocity, Variety, and Veracity

Taming BIG DATA

o Divide & Conquero Partition large problem into

smaller “independent” sub-problems – Can be handled by different workers

• Threads in a processor core• Cores in a multi/many core processor• Multiple processors in a machine• Multiple machines in a cluster• Multiple clusters in a cloud• ……

Abstraction

Data

Taming BIG DATA

o Divide & Conquero Partition large problem into

smaller “independent” sub-problems

oHow can distribution of work be done?

Data

MapReduce

• MapReduce [OSDI’04] provides • Automatic parallelization, distribution• I/O scheduling

• Load balancing• Network and data transfer optimization

• Fault tolerance• Handling of machine failures

• Need more power: Scale out, not up!• Large number of commodity servers as opposed to some high end specialized servers

9

MapReduce workflow

10

Worker

WorkerWorker

Worker

Worker

readlocalwrite

remoteread,sort

OutputFile 0

OutputFile 1

write

Split 0Split 1Split 2

Input Data Output Data

Mapextract something you care about from each

record

Reduceaggregate,

summarize, filter, or transform

11

http://kickstarthadoop.blogspot.ca/2011/04/word-count-hadoop-map-reduce-example.html

Example: Word Count

MapReduce

12

HadoopProgram

Master

fork fork fork

assignmap

assignreduce

Worker

WorkerWorker

Worker

Worker

readlocalwrite

remoteread,sort


Input Data

Map Reduce

OutputFile 0

OutputFile 1

write

Output Data

Transfer peta-scale data

through network

Google File System (GFS)Hadoop Distributed File System (HDFS)• Split data and store 3 replica on commodity servers

13

MapReduce

14

Masterassignmap assign

reduce

Worker

WorkerWorker

Worker

Worker

localwrite

remoteread,sort

OutputFile 0

OutputFile 1

write


Split 0

Split 1

Split 2

Input Data Output Data

Map Reduce

HDFSNameNode

Read from local disk

Where are the chunks of input data?Location of the

chunks of input data

Scalding jobs subclass Job

Yep, we’re counting words:

Suitable for your task if

• Have a cluster• Working with large dataset• Working with independent data (or assumed)• Can be cast into map and reduce

16

Pros & Cons

• MapReduce and its variants greatly simplified big data analytics by hiding scaling and faults

• However, these systems provide a restricted programming model

Acyclic data flowMost current cluster programming models are based on acyclic data flow from stable storage to stable storage

Map

Map

Map

Reduce

Reduce

Input Output

20

Inefficiency of MapReduce

<k1, v1>

mapper mappermapper

<k1, v1> <k1, v1> <k1, v1>

reducer reducer

<k2, v2> <k2, v2> <k2, v2>

<k3, v3> <k3, v3>

Blocking: Reduce does not start until all Map tasks are completed

Other reasons?

Intermediate results shipping: all to all Write to disk and read from disk in each step,

although the data does not change in loops

Inefficiency of MapReduce

• Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data:

• Iterative algorithms (machine learning, graphs)• Interactive data mining tools (R, Excel, Python)

• With current frameworks, apps reload data from stable storage on each query

Example: Logistic Regression• Goal: find best line separating two sets of points

target

random initial line

Example 1: Logistic Regression

val data = textFile(...).map(readPoint)

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = data.map(p =>(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x

).reduce(_ + _)

w -= gradient

}

println("Final w: " + w)

Logistic Regression Performance

127 s / iteration

first iteration 174 sfurther iterations 6 s

Iteration optimized framework (Spark, Flink, …)

Example 2: matrix completion

1 4 3

44

2 44

Matrix Completion (MC)

Let:be a low-rank matrix , r := rank(L0) << n,m

Alternating least squares

Example 3: Node Importance in Social Networks

• General idea is that some nodes are more important than others in terms of the structure of the graph

• In a directed graph, “in-degree” may be a useful indicator of importance

• e.g., for a citation network among authors (or papers)• in-degree is the number of citations => “importance”

• However:• “in-degree” is only a first-order measure in that it implicitly assumes that all

edges are of equal importance

PageRank Algorithm

1. Crawl the Web to get nodes (pages) and links (hyperlinks) [highly non-trivial problem!]

2. Weights from each page = 1/(# of outlinks)3. Solve for the eigenvector r (for λ = 1) of the weight matrix

Computational Problem:• Solving an eigenvector equation scales as O(n3)• For the entire Web graph n > 10 billion (!!)• So direct solution is not feasible

Can use the power method (iterative)

r (k+1) = WT r (k)

for k=1,2,…..

Frameworks?Flink

Native workload support

33

Flink

Streaming topologies

Long batchpipelines

Machine Learning at scale

How can an engine natively support all these workloads?And what does "native" mean?

Graph Analysis

Apache Flink

Program compilation

40

case class Path (from: Long, to:Long)val tc = edges.iterate(10) {

paths: DataSet[Path] =>val next = paths

.join(edges)

.where("to")

.equalTo("from") {(path, edge) =>

Path(path.from, edge.to)}.union(paths).distinct()

next}

Optimizer

Type extraction stack

Task scheduling

Dataflowmetadata

Pre-flight (Client)

MasterWorkers

DataSource

orders.tbl

Filter

Map DataSource

lineitem.tbl

JoinHybrid Hash

buildHT

probe

hash-part [0] hash-part [0]

GroupRed

sort

forward

Program

DataflowGraph

deployoperators

trackintermediate

results

Philosophy

• Flink “hides” its internal workings from the user• This is good

• User does not worry about how jobs are executed• Internals can be changed without breaking changes

• … and bad• Execution model more complicated to explain compared to MapReduce or

Spark RDD

41

Recap: DataSet

Input First SecondX Y

Operator X Operator Y

42

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

DataSet<String> input = env.readTextFile(input);

DataSet<String> first = input.filter (str -> str.contains(“Apache Flink“));

DataSet<String> second = first.filter (str -> str.length() > 40);

second.print()env.execute();

Common misconception

• Programs are not executed eagerly• Instead, system compiles program to an execution

plan and executes that plan

Input First SecondX Y

43

Write once, run everywhere

45

> bin/flink run prg.jar

Packaged ProgramsRemote EnvironmentLocal Environment

Program JAR file

JVM

master master

RPC &Serialization

RemoteEnvironment.execute()LocalEnvironment.execute()

Spawn embeddedmulti-threaded environment

Submitting a Flink job

• /bin/flink (Command Line)• RemoteExecutionEnvironment

(From a local or remote java app)

• Web Frontend (GUI)• Scala Shell

Web Frontends – Web Job ClientSelect jobs and preview plan

Understand Optimizer choices

Web Frontends – Job ManagerOverall system status

Job execution details

Task Manager resourceutilization

Debugging on a cluster

• Good old system out debugging• Get a logger

• Start logging

• You can also use System.out.println().

private static final Logger LOG = LoggerFactory.getLogger(YourJob.class);

LOG.info("elementCount = {}", elementCount);

Getting logs on a cluster

• The logs are located in each TaskManager’s log/ directory.

• ssh there and read the logs.

Flink Logs11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - --------------------------------------------------------------------------------

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager (Version: 0.9-SNAPSHOT, Rev:2e515fc, Date:27.05.2015 @ 11:24:23 CEST)

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Current user: robert

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.75-b04

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Maximum heap size: 736 MiBytes

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - JAVA_HOME: (not set)

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM Options:

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -XX:MaxPermSize=256m

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xms768m

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xmx768m

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlog.file=/home/robert/incubator-flink/build-target/bin/../log/flink-robert-jobmanager-robert-da.log

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlog4j.configuration=file:/home/robert/incubator-flink/build-target/bin/../conf/log4j.properties

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlogback.configurationFile=file:/home/robert/incubator-flink/build-target/bin/../conf/logback.xml

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Program Arguments:

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - --configDir

11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - /home/robert/incubator-flink/build-target/bin/../conf

11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - --executionMode

11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - local

11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - --streamingMode

11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - batch

11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - --------------------------------------------------------------------------------

11:42:39,469 INFO org.apache.flink.runtime.jobmanager.JobManager - Loading configuration from /home/robert/incubator-flink/build-target/bin/../conf

11:42:39,525 INFO org.apache.flink.runtime.jobmanager.JobManager - Security is not enabled. Starting non-authenticated JobManager.

11:42:39,525 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager

11:42:39,527 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor system at localhost:6123.

11:42:40,189 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started

11:42:40,316 INFO Remoting - Starting remoting

11:42:40,569 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://[email protected]:6123]

11:42:40,573 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor

flink.apache.org 52

Build Information

JVM details

Init messages

53

The Framework of Big Data

Table API

54

Table API Overview• Makes analysis of structured data very easy

• Evaluates SQL-like expressions• Code generation for execution

• Tight integration with DataSet API• Convert DataSet to Table and back

55

Table API Overview

• Basic data structure is a Table

• Table is structured data with named fields• Similar to relational table

• Expressions evaluated on a Table yield a new Table

56

Table API Expressions• Table t = orig.as(“author, title, pages“);

// filter table• Table t2 = t.filter(“pages > 100”);// project table

• Table t3 = t.select(“author, title”);• Table t4 = t.select(“pages*2 as dPages”);// group table and compute aggregations

• Table t5 = t.groupBy(“author”).select(“pages.avg as avgPages”);

// join two table• Table t6 = t.join(t.select(“author2,title2”))

.where(“author = author2”)

.select(“title, title2”);57

DataSet to Table

Java DataSet API via TableEnvironment:

// data setDataSet<Table3<String,Long,Double>> ds = …;

// get a TableEnvironmentTableEnvironment tEnv = new TableEnvironment();

// convert data set to Table and give name to fieldsTable t = tEnv.toTable(ds).as(“name, count, price”);

58

Table to DataSet• Java DataSet API via TableEnvironment• Convert to custom POJO data set

• Pojo fields must map to Table fields

public static class Stock {public String name;public int count;public double price;

}

Table t = x.as(“name, count, price”);

TableEnvironment tEnv = new TableEnvironment();

DataSet<Stock> ds = tEnv.toSet(t, Stock.class);

59

Table to DataSetJava DataSet API via TableEnvironment• Convert to DataSet<Row>• Most valuable for printing

Table t = x.as(“name, count, price”);TableEnvironment tEnv = new TableEnvironment();tEnv.toSet(t, Row.class).print();

60

Documents

Analyse von großen Datensätzen in den Lebenswissenschaften ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in_Li… · Taming BIG DATA. oDivide & Conquer oPartition