Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Analyse von großen Datensätzen in den Lebenswissenschaften (192217)
Christof Schütte & Tim Conrad
Session 6
Overview
• Recap: what happened so far• Scale it up! Flink @ Compute Cluster• Assignments:
• You can resubmit one assignment to get full points
• Next week: • For each assignment: best solution• Start of project
What Is Big Data?• There is not a consensus as to how to define big data
“Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it with in a tolerable elapsed time for its user population.” - Teradata Magazine article, 2011
“Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.” - The McKinsey Global Institute, 2011
Characteristics of big data
5Source: IBM methodology
Big data spans four dimensions: Volume, Velocity, Variety, and Veracity
Taming BIG DATA
o Divide & Conquero Partition large problem into
smaller “independent” sub-problems – Can be handled by different workers
• Threads in a processor core• Cores in a multi/many core processor• Multiple processors in a machine• Multiple machines in a cluster• Multiple clusters in a cloud• ……
Abstraction
Data
Taming BIG DATA
o Divide & Conquero Partition large problem into
smaller “independent” sub-problems
oHow can distribution of work be done?
Data
MapReduce
• MapReduce [OSDI’04] provides • Automatic parallelization, distribution• I/O scheduling
• Load balancing• Network and data transfer optimization
• Fault tolerance• Handling of machine failures
• Need more power: Scale out, not up!• Large number of commodity servers as opposed to some high end specialized servers
9
MapReduce workflow
10
Worker
WorkerWorker
Worker
Worker
readlocalwrite
remoteread,sort
OutputFile 0
OutputFile 1
write
Split 0Split 1Split 2
Input Data Output Data
Mapextract something you care about from each
record
Reduceaggregate,
summarize, filter, or transform
11
http://kickstarthadoop.blogspot.ca/2011/04/word-count-hadoop-map-reduce-example.html
Example: Word Count
MapReduce
12
HadoopProgram
Master
fork fork fork
assignmap
assignreduce
Worker
WorkerWorker
Worker
Worker
readlocalwrite
remoteread,sort
Split 0Split 1Split 2
Input Data
Map Reduce
OutputFile 0
OutputFile 1
write
Output Data
Transfer peta-scale data
through network
Google File System (GFS)Hadoop Distributed File System (HDFS)• Split data and store 3 replica on commodity servers
13
MapReduce
14
Masterassignmap assign
reduce
Worker
WorkerWorker
Worker
Worker
localwrite
remoteread,sort
OutputFile 0
OutputFile 1
write
Split 0Split 1Split 2
Split 0
Split 1
Split 2
Input Data Output Data
Map Reduce
HDFSNameNode
Read from local disk
Where are the chunks of input data?Location of the
chunks of input data
Suitable for your task if
• Have a cluster• Working with large dataset• Working with independent data (or assumed)• Can be cast into map and reduce
16
Pros & Cons
• MapReduce and its variants greatly simplified big data analytics by hiding scaling and faults
• However, these systems provide a restricted programming model
Acyclic data flowMost current cluster programming models are based on acyclic data flow from stable storage to stable storage
Map
Map
Map
Reduce
Reduce
Input Output
20
Inefficiency of MapReduce
<k1, v1>
mapper mappermapper
<k1, v1> <k1, v1> <k1, v1>
reducer reducer
<k2, v2> <k2, v2> <k2, v2>
<k3, v3> <k3, v3>
Blocking: Reduce does not start until all Map tasks are completed
Other reasons?
Intermediate results shipping: all to all Write to disk and read from disk in each step,
although the data does not change in loops
Inefficiency of MapReduce
• Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data:
• Iterative algorithms (machine learning, graphs)• Interactive data mining tools (R, Excel, Python)
• With current frameworks, apps reload data from stable storage on each query
Example: Logistic Regression• Goal: find best line separating two sets of points
target
random initial line
Example 1: Logistic Regression
val data = textFile(...).map(readPoint)
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final w: " + w)
Logistic Regression Performance
127 s / iteration
first iteration 174 sfurther iterations 6 s
Iteration optimized framework (Spark, Flink, …)
Example 3: Node Importance in Social Networks
• General idea is that some nodes are more important than others in terms of the structure of the graph
• In a directed graph, “in-degree” may be a useful indicator of importance
• e.g., for a citation network among authors (or papers)• in-degree is the number of citations => “importance”
• However:• “in-degree” is only a first-order measure in that it implicitly assumes that all
edges are of equal importance
PageRank Algorithm
1. Crawl the Web to get nodes (pages) and links (hyperlinks) [highly non-trivial problem!]
2. Weights from each page = 1/(# of outlinks)3. Solve for the eigenvector r (for λ = 1) of the weight matrix
Computational Problem:• Solving an eigenvector equation scales as O(n3)• For the entire Web graph n > 10 billion (!!)• So direct solution is not feasible
Can use the power method (iterative)
r (k+1) = WT r (k)
for k=1,2,…..
Native workload support
33
Flink
Streaming topologies
Long batchpipelines
Machine Learning at scale
How can an engine natively support all these workloads?And what does "native" mean?
Graph Analysis
Program compilation
40
case class Path (from: Long, to:Long)val tc = edges.iterate(10) {
paths: DataSet[Path] =>val next = paths
.join(edges)
.where("to")
.equalTo("from") {(path, edge) =>
Path(path.from, edge.to)}.union(paths).distinct()
next}
Optimizer
Type extraction stack
Task scheduling
Dataflowmetadata
Pre-flight (Client)
MasterWorkers
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
JoinHybrid Hash
buildHT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
DataflowGraph
deployoperators
trackintermediate
results
Philosophy
• Flink “hides” its internal workings from the user• This is good
• User does not worry about how jobs are executed• Internals can be changed without breaking changes
• … and bad• Execution model more complicated to explain compared to MapReduce or
Spark RDD
41
Recap: DataSet
Input First SecondX Y
Operator X Operator Y
42
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> input = env.readTextFile(input);
DataSet<String> first = input.filter (str -> str.contains(“Apache Flink“));
DataSet<String> second = first.filter (str -> str.length() > 40);
second.print()env.execute();
Common misconception
• Programs are not executed eagerly• Instead, system compiles program to an execution
plan and executes that plan
Input First SecondX Y
43
Write once, run everywhere
45
> bin/flink run prg.jar
Packaged ProgramsRemote EnvironmentLocal Environment
Program JAR file
JVM
master master
RPC &Serialization
RemoteEnvironment.execute()LocalEnvironment.execute()
Spawn embeddedmulti-threaded environment
Submitting a Flink job
• /bin/flink (Command Line)• RemoteExecutionEnvironment
(From a local or remote java app)
• Web Frontend (GUI)• Scala Shell
Web Frontends – Job ManagerOverall system status
Job execution details
Task Manager resourceutilization
Debugging on a cluster
• Good old system out debugging• Get a logger
• Start logging
• You can also use System.out.println().
private static final Logger LOG = LoggerFactory.getLogger(YourJob.class);
LOG.info("elementCount = {}", elementCount);
Getting logs on a cluster
• The logs are located in each TaskManager’s log/ directory.
• ssh there and read the logs.
Flink Logs11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - --------------------------------------------------------------------------------
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager (Version: 0.9-SNAPSHOT, Rev:2e515fc, Date:27.05.2015 @ 11:24:23 CEST)
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Current user: robert
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.75-b04
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Maximum heap size: 736 MiBytes
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - JAVA_HOME: (not set)
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM Options:
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -XX:MaxPermSize=256m
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xms768m
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xmx768m
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlog.file=/home/robert/incubator-flink/build-target/bin/../log/flink-robert-jobmanager-robert-da.log
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlog4j.configuration=file:/home/robert/incubator-flink/build-target/bin/../conf/log4j.properties
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlogback.configurationFile=file:/home/robert/incubator-flink/build-target/bin/../conf/logback.xml
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - Program Arguments:
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - --configDir
11:42:39,233 INFO org.apache.flink.runtime.jobmanager.JobManager - /home/robert/incubator-flink/build-target/bin/../conf
11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - --executionMode
11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - local
11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - --streamingMode
11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - batch
11:42:39,234 INFO org.apache.flink.runtime.jobmanager.JobManager - --------------------------------------------------------------------------------
11:42:39,469 INFO org.apache.flink.runtime.jobmanager.JobManager - Loading configuration from /home/robert/incubator-flink/build-target/bin/../conf
11:42:39,525 INFO org.apache.flink.runtime.jobmanager.JobManager - Security is not enabled. Starting non-authenticated JobManager.
11:42:39,525 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager
11:42:39,527 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor system at localhost:6123.
11:42:40,189 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
11:42:40,316 INFO Remoting - Starting remoting
11:42:40,569 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://[email protected]:6123]
11:42:40,573 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor
flink.apache.org 52
Build Information
JVM details
Init messages
Table API Overview• Makes analysis of structured data very easy
• Evaluates SQL-like expressions• Code generation for execution
• Tight integration with DataSet API• Convert DataSet to Table and back
55
Table API Overview
• Basic data structure is a Table
• Table is structured data with named fields• Similar to relational table
• Expressions evaluated on a Table yield a new Table
56
Table API Expressions• Table t = orig.as(“author, title, pages“);
// filter table• Table t2 = t.filter(“pages > 100”);// project table
• Table t3 = t.select(“author, title”);• Table t4 = t.select(“pages*2 as dPages”);// group table and compute aggregations
• Table t5 = t.groupBy(“author”).select(“pages.avg as avgPages”);
// join two table• Table t6 = t.join(t.select(“author2,title2”))
.where(“author = author2”)
.select(“title, title2”);57
DataSet to Table
Java DataSet API via TableEnvironment:
// data setDataSet<Table3<String,Long,Double>> ds = …;
// get a TableEnvironmentTableEnvironment tEnv = new TableEnvironment();
// convert data set to Table and give name to fieldsTable t = tEnv.toTable(ds).as(“name, count, price”);
58
Table to DataSet• Java DataSet API via TableEnvironment• Convert to custom POJO data set
• Pojo fields must map to Table fields
public static class Stock {public String name;public int count;public double price;
}
Table t = x.as(“name, count, price”);
TableEnvironment tEnv = new TableEnvironment();
DataSet<Stock> ds = tEnv.toSet(t, Stock.class);
59