Yannis Ioannidis MaDgIK Lab University of Athens & ATHENA
Research Center
Slide 2
Motivation Relevant Systems & Languages The ADP System
Optimal Dataflow Scheduling Conclusions & Future work
Slide 3
Slide 4
High-level Data Languages Languages to easily express data
operations Semantics Big Data Processing TB or PB of data
(scientific, sensors, ) Efficiency (Query) optimization Reconciling
efficiency and semantics
Slide 5
Op1 Op-i Data1 Data-j Q C-k A Cloud Processing Containers Data
Registry Operator Registry
Slide 6
select name, department, salary from employee e, dept d where
e.position = director and e.did = d.id employee dept select join
project dept join partition project join employee 1 employee 2
partition project select partition union
Slide 7
Query:graph of relational algebra operators Optimality:response
time or completion time Environment:cluster of dedicated
distributed / parallel hosts
Slide 8
Query:graph of arbitrary operators Optimality:response time or
completion time and money Environment:cloud of hosts
(elasticity)
Slide 9
1 2 3 4 5 Graph of arbitrary operators Non-relational data
analytics Query log analysis Data mining Simulation model
composition User behavior analysis for European national libraries
One of sixteen flows
Slide 10
select name, department, salary from employee e, dept d where
haircolor(e.picture) = gray and e.did = d.id employee dept
haircolor join project dept join partition project join employee 1
employee 2 partition project haircolor partition union haircolor
Operator Registry f f f f f f
Virtualized IT resources offered as on-demand service Software
as a Service (SaaS) Platform as a Service (PaaS) Infrastructure as
a Service (IaaS) Variety of charging and use policies
Slide 13
Cloud of hosts (elasticity) Virtual resources (virtual hosts =
containers) Available on demand Used for as much time needed Leased
on a per quantum pricing scheme Illusion of infinite resources
Arbitrary # of choices of price/performance ratio
Slide 14
Slide 15
Hadoop Open source software for reliable, scalable, distributed
computing Won Jim Grays Terabyte Sort Benchmark in 2008 (209
seconds) Hive (Facebook) Data warehouse build on top of Hadoop
Condor Designed for CPU intensive applications Pegasus Scientific
workflows on the Grid Cassandra (Facebook) P2P database with linear
scalability
Slide 16
Map-Reduce (Google) Jim Grays Terabyte Sort Benchmark in 68
seconds in 2009 PNUTS (Yahoo! Research) Massively parallel &
geographically distributed database system Dryad (Microsoft
Research) General-purpose distributed execution engine for
coarse-grain data-parallel applications Greenplum Massively
parallel processing (MPP) database Hadapt SQL processing on top of
Hadoop
Slide 17
SQL Hive-QL Pig-Latin Dataflow language Datalog Mashups Yahoo!
pipes MashQL
Slide 18
Hive-QL is based on SQL CREATE TABLE page_view( viewTime INT,
userid BIGINT, page_url STRING, referrer_url STRING, ip STRING
COMMENT 'IP Address of the User') COMMENT 'This is the page view
table' PARTITIONED BY( dt STRING, country STRING) STORED AS
SEQUENCEFILE; INSERT OVERWRITE TABLE xyz_com_page_views SELECT
page_views.* FROM page_views WHERE page_views.date >=
'2008-03-01' AND page_views.date