Yannis Ioannidis MaDgIK Lab University of Athens & ATHENA Research Center

Motivation Relevant Systems & Languages The ADP System Optimal Dataflow Scheduling Conclusions & Future work

High-level Data Languages Languages to easily express data operations Semantics Big Data Processing TB or PB of data (scientific, sensors, ) Efficiency (Query) optimization Reconciling efficiency and semantics

Op1 Op-i Data1 Data-j Q C-k A Cloud Processing Containers Data Registry Operator Registry

select name, department, salary from employee e, dept d where e.position = director and e.did = d.id employee dept select join project dept join partition project join employee 1 employee 2 partition project select partition union

Query:graph of relational algebra operators Optimality:response time or completion time Environment:cluster of dedicated distributed / parallel hosts

Query:graph of arbitrary operators Optimality:response time or completion time and money Environment:cloud of hosts (elasticity)

1 2 3 4 5 Graph of arbitrary operators Non-relational data analytics Query log analysis Data mining Simulation model composition User behavior analysis for European national libraries One of sixteen flows

select name, department, salary from employee e, dept d where haircolor(e.picture) = gray and e.did = d.id employee dept haircolor join project dept join partition project join employee 1 employee 2 partition project haircolor partition union haircolor Operator Registry f f f f f f

dept join partition project join employee 1 employee 2 partition project haircolor partition union Cloud

Virtualized IT resources offered as on-demand service Software as a Service (SaaS) Platform as a Service (PaaS) Infrastructure as a Service (IaaS) Variety of charging and use policies

Cloud of hosts (elasticity) Virtual resources (virtual hosts = containers) Available on demand Used for as much time needed Leased on a per quantum pricing scheme Illusion of infinite resources Arbitrary # of choices of price/performance ratio

Hadoop Open source software for reliable, scalable, distributed computing Won Jim Grays Terabyte Sort Benchmark in 2008 (209 seconds) Hive (Facebook) Data warehouse build on top of Hadoop Condor Designed for CPU intensive applications Pegasus Scientific workflows on the Grid Cassandra (Facebook) P2P database with linear scalability

Map-Reduce (Google) Jim Grays Terabyte Sort Benchmark in 68 seconds in 2009 PNUTS (Yahoo! Research) Massively parallel & geographically distributed database system Dryad (Microsoft Research) General-purpose distributed execution engine for coarse-grain data-parallel applications Greenplum Massively parallel processing (MPP) database Hadapt SQL processing on top of Hadoop

SQL Hive-QL Pig-Latin Dataflow language Datalog Mashups Yahoo! pipes MashQL

Hive-QL is based on SQL CREATE TABLE page_view( viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY( dt STRING, country STRING) STORED AS SEQUENCEFILE; INSERT OVERWRITE TABLE xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date

Documents

Yannis Ioannidis MaDgIK Lab University of Athens & ATHENA Research Center