Upload
toby
View
26
Download
0
Tags:
Embed Size (px)
DESCRIPTION
SALSA Group Research Activities. April 27, 2011. Research Overview. MapReduce Runtime Twister Azure MapReduce Dryad and Parallel Applications NIH Projects Bioinformatics Workflow Data Visualization – GTM/MDS/ PlotViz Education. Twister & Azure MapReduce. What is Twister?. - PowerPoint PPT Presentation
Citation preview
SALSA Group Research Activities
April 27, 2011
Research OverviewMapReduce Runtime
TwisterAzure MapReduce
Dryad and Parallel ApplicationsNIH Projects
Bioinformatics WorkflowData Visualization – GTM/MDS/PlotViz
Education
Twister & Azure MapReduce
What is Twister?Twister is an Iterative MapReduce
Framework which supportsCustomized static input data partitionCacheable map/reduce tasks Combining operation to converge
intermediate outputs to main programFault recovery between iterations
Twister Programming Model
Twister Architecture
Applications and Performance
MapReduceRoles for Azure
MapReduce framework for Azure Cloud Built using highly-available and scalable Azure cloud
services Distributed, highly scalable & highly available services Minimal management / maintenance overhead Reduced footprint
Hides the complexity of cloud & cloud services from the users
Co-exist with eventual consistency & high latency of cloud services
Decentralized control avoids single point of failure
MapReduceRoles for Azure
• Supports dynamically scaling up and down of the compute resources.
• Fault Tolerance
• Combiner step• Web based monitoring console• Easy testing and deployment
Twister for Azure
Reduce
Reduce
MergeAdd
Iteration? No
Map Combine
Map Combine
Map Combine
Data Cache
Yes
Hybrid scheduling of the new iteration
Job Start
Job Finish
Iterative MapReduce Framework for Microsoft Azure Cloud.
Merge Step In-Memory Caching of static data Cache aware hybrid scheduling using Queues
as well as using a bulletin board
Map 1
Map 2
Map n
Map Workers
Red 1
Red 2
Red n
Reduce Workers
In Memory Data Cache
Task Monitoring
Role Monitoring
Worker Role
MapID ……. Status
Map Task Table
MapID ……. Status
Job Bulleting Board
Scheduling Queue
Kmeans Performance with/without data caching.
Performance Comparisons
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
128 228 328 428 528 628 728
Par
alle
l Effi
cien
cy
Number of Query Files
Twister4Azure
Hadoop-Blast
DryadLINQ-Blast
0
500
1000
1500
2000
2500
3000
Adju
sted
Time (
s)
Num. of Cores * Num. of Blocks
Twister4Azure
Amazon EMR
Apache Hadoop50%55%60%65%70%75%80%85%90%95%
100%
Para
llel E
ffici
ency
Num. of Cores * Num. of Files
Twister4Azure
Amazon EMR
Apache Hadoop
BLAST Sequence Search
Cap3 Sequence AssemblySmith Watermann Sequence Alignment
0%
20%
40%
60%
80%
100%
120%
140%
160%
0
200
400
600
800
1000
1200
1400
1600
8 X 16M 16 X 32M 32 X 64M 48 X 96M 64 X 128M
Rela
tive
Para
llel E
ffici
ency
Tim
e (s
)
Num Instances X Num Data Points
Relative ParallelEfficiencyTime(s)
Kmeans Scaling speedup Kmeans Increasing number of iterations
Dryad & Parallel Applications
DryadLINQ CTP Evaluation The beta version released on Dec 2010 Motivation:
Evaluate key features and interface in DryadLINQStudy parallel programming model in DryadLINQ
Three applicationsSW-G bioinformatics application Matrix Matrix MultiplicationPageRank
Parallel programming model DryadLINQ store input data as DistributedQuery<T>
objects It splits distributed objects into partitions with following
APIs: AsDistributed() RangePartition()
...
Compute node
Vertex 1
Data
Compute node
Vertex 2
Data
Compute node
Vertex n
Data
Compute node
Dryad graph manager
Head node
DSC Service
HPC Job Scheduler
Service
DSC
Window HPC Server 2008 R2 Cluster
HPC Client Utilites
DSC Client Service
DryadLINQ Provider
Workstation computer
Common LINQ providers
Provider Base class
LINQ-to-objects IEnumerable<T>
PLINQ ParallelQuery<T>
LINQ-to-SQL IQueryable<T>
LINQ-to-? IQueryable<T>
DryadLINQ DistributedQuery<T>
Matrix-Matrix Multiplication Parallel programming algorithms
Row split Row Column split 2 dimensional block decomposition in Fox algorithm
Multi core technologies in .NET TPL, PLINQ, Thread pool
Hybrid parallel model Port multi-core to Dryad task to improve performance
Fox-DSC RowColumn-DSC RowSplit-DSC0
50
100
150
200
250
TPLThreadTaskPLINQ
PageRank Grouped Aggregation
A core primitive of many distributed programming models.
Two stage:1) Partition the data into groups by some keys 2) Performs an aggregation over each groups
DryadLINQ provide two types of grouped aggregation GroupBy(), without partial aggregation optimization. GroupAndAggregate(), with partial aggregation.
1280 960 640 3200
500
1000
1500
2000
2500
3000
3500
GroupAndAggregateTwoApplyPerpartitionOneApplyPerPartitionGroupByHierarchicalAggregation
number of am files
Seco
nds
NIH Projects
Sequence Clustering
Gene Sequences
Pairwise Alignment &
Distance Calculation
Distance Matrix
Pairwise Clustering
Multi-Dimensional
Scaling
Visualization
Cluster Indices
Coordinates
3D Plot
Smith-Waterman / Needleman-Wunsch
with Kimura2 / Jukes-Cantor / Percent-
Identity
MPI.NET Implementation
MPI.NET Implementation
MPI.NET Implementation
Chi-Square / Deterministic
Annealing
C# Desktop Application based
on VTK
* Note. The implementations of Smith-Waterman and Needleman-Wunsch algorithms are from Microsoft Biology Foundation library
Scale-up Sequence Clustering with Twister
Gene Sequences (N = 1 Million)
Distance Matrix
Interpolative MDS with Pairwise
Distance Calculation
Multi-Dimensional
Scaling (MDS)
Visualization 3D Plot
Reference Sequence Set (M = 100K)
N - M Sequence
Set (900K)
Select Reference
Reference Coordinates
x, y, z
N - M Coordinates
x, y, z
Pairwise Alignment &
Distance Calculation
O(MxM)
O(MxM)
O(Mx(N-1))
e.g. 25 Million
Services and SupportWeb Portal and Metadata
ManagementCGB work
// todo - Ryan
GTM vs. MDSGTM MDS (SMACOF)
Maximize Log-Likelihood Minimize STRESS or SSTRESSObjectiveFunction
O(KN) (K << N) O(N2)Complexity
• Non-linear dimension reduction• Find an optimal configuration in a lower-dimension• Iterative optimization method
Purpose
EM Iterative Majorization (EM-like)OptimizationMethod
Vector-based data Non-vector (Pairwise similarity matrix)Input
23
PlotViz
Visualization Algorithms Chem2Bio2RDF
PlotViz
Parallel dimension reduction algorithms
Aggregated public databases
3-D M
ap Fi
le SPARQL queryMeta data
Light-weight client
PubChem
CTD
DrugBank
QSAR
Education
SALSAHPC Dynamic Virtual Cluster on FutureGrid -- Demo at SC09
Pub/Sub Broker Network
Summarizer
Switcher
Monitoring Interface
iDataplex Bare-metal Nodes
XCAT Infrastructure
Virtual/Physical Clusters
Monitoring & Control Infrastructure
iDataplex Bare-metal Nodes (32 nodes)
XCAT Infrastructure
Linux Bare-
system
Linux on Xen
Windows Server 2008 Bare-system
SW-G Using Hadoop
SW-G Using Hadoop
SW-G Using DryadLINQ
Monitoring Infrastructure
Dynamic Cluster Architecture
Demonstrate the concept of Science on Clouds on FutureGrid
SALSAHPC Dynamic Virtual Cluster on FutureGrid -- Demo at SC09Demonstrate the concept of Science
on Clouds using a FutureGrid clusterhttp://salsahpc.indiana.edu/b534
http://salsahpc.indiana.edu/b534projects